The document describes an entity linking approach to generate a personalized timeline of historic events for a user. It involves 4 main parts: (1) fetching candidate historic events from DBpedia, (2) generating a user profile based on information extracted from the user's Facebook profile, (3) matching the candidate events to the user's interests in their profile, and (4) scoring and ranking the events to produce the final personalized timeline. The approach uses entity linking techniques to associate mentions of entities in the user's profile with the corresponding entries in a knowledge base, in order to identify the user's interests.
This document discusses understanding email traffic patterns through recipient recommendation. It explores using social network analysis and language models of email content to predict likely recipients of a given email. Specifically, it examines using measures of node importance in the network, strength of connections between nodes, and similarity between language models of communication profiles to rank and select recipient nodes. The findings indicate that combining social network analysis and language modeling performs better than either approach individually, and that language model similarity is most important for interpersonal communication, while network metrics are more informative for highly active users. Recipient recommendation could help with applications like anomaly detection in e-discovery.
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsDavid Graus
The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.
This document discusses adding semantic structure to real-time social data from Twitter through Twitter Annotations. It describes how Annotations can be mapped to existing Semantic Web vocabularies and linked to datasets to enable real-time semantic search over social and linked data. A system called TwitLogic is presented that captures Twitter data, converts it to RDF, and publishes it as linked streams to allow for continuous querying and integration with the live Semantic Web.
The Gulf Tower Project aims to use data from Instagram photos to determine the mood of Pittsburgh and display it through light colors on the Gulf Tower. Photos are analyzed using sentiment analysis to assign scores and categorize them as positive, negative, or neutral. Over 16,000 photos were collected, with most being positive. The colors mapped to emotions will light up the Gulf Tower to visually show Pittsburgh's mood. Challenges included API limits, technical issues, and ensuring the story was understandable. The goal is to use technology to represent community feelings from social media data in an artistic display.
This document provides an overview of the evolution of information and technology over time. It begins with ancient symbols and manuscripts, then discusses the development of the telegraph, telephone, radio, and early computers. It outlines the creation of the internet and the world wide web, and how they led to an explosion of information sharing. The document discusses challenges of information overload and different search technologies and services that have been developed to help users find relevant information. It promotes the use of the BOSS API and other tools to build custom search applications and solutions.
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
The New York Times has had search for a long time but 2018 was the year in which the company engaged with relevance in a deep way. The aim of this talk is to share what we've learned as we've increased our search sophistication and some of the challenges we still face.
Some of the techniques we've adopted in this past year include offline metrics testing, reflective testing, and user engagement metrics. We now have a process in place to quickly get mappings changes out to production. As a team we now also have a vocabulary for talking about relevance and can use it to discuss trade-offs and goals in conjunction with our metrics.
We hope this talk is of use to those who've put off working on search relevance due to fear, uncertainty, or ambivalence. We will talk about how we went from working on everything but search relevance to finally pulling back the curtain on the search system. We hope what we've learned can help others get started.
This document discusses understanding email traffic patterns through recipient recommendation. It explores using social network analysis and language models of email content to predict likely recipients of a given email. Specifically, it examines using measures of node importance in the network, strength of connections between nodes, and similarity between language models of communication profiles to rank and select recipient nodes. The findings indicate that combining social network analysis and language modeling performs better than either approach individually, and that language model similarity is most important for interpersonal communication, while network metrics are more informative for highly active users. Recipient recommendation could help with applications like anomaly detection in e-discovery.
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsDavid Graus
The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.
This document discusses adding semantic structure to real-time social data from Twitter through Twitter Annotations. It describes how Annotations can be mapped to existing Semantic Web vocabularies and linked to datasets to enable real-time semantic search over social and linked data. A system called TwitLogic is presented that captures Twitter data, converts it to RDF, and publishes it as linked streams to allow for continuous querying and integration with the live Semantic Web.
The Gulf Tower Project aims to use data from Instagram photos to determine the mood of Pittsburgh and display it through light colors on the Gulf Tower. Photos are analyzed using sentiment analysis to assign scores and categorize them as positive, negative, or neutral. Over 16,000 photos were collected, with most being positive. The colors mapped to emotions will light up the Gulf Tower to visually show Pittsburgh's mood. Challenges included API limits, technical issues, and ensuring the story was understandable. The goal is to use technology to represent community feelings from social media data in an artistic display.
This document provides an overview of the evolution of information and technology over time. It begins with ancient symbols and manuscripts, then discusses the development of the telegraph, telephone, radio, and early computers. It outlines the creation of the internet and the world wide web, and how they led to an explosion of information sharing. The document discusses challenges of information overload and different search technologies and services that have been developed to help users find relevant information. It promotes the use of the BOSS API and other tools to build custom search applications and solutions.
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
The New York Times has had search for a long time but 2018 was the year in which the company engaged with relevance in a deep way. The aim of this talk is to share what we've learned as we've increased our search sophistication and some of the challenges we still face.
Some of the techniques we've adopted in this past year include offline metrics testing, reflective testing, and user engagement metrics. We now have a process in place to quickly get mappings changes out to production. As a team we now also have a vocabulary for talking about relevance and can use it to discuss trade-offs and goals in conjunction with our metrics.
We hope this talk is of use to those who've put off working on search relevance due to fear, uncertainty, or ambivalence. We will talk about how we went from working on everything but search relevance to finally pulling back the curtain on the search system. We hope what we've learned can help others get started.
ServerSide Javascript on Freebase - SF JavaScript meetup #9Will Moffat
This document summarizes a presentation about Acre, which allows running server-side JavaScript on Freebase.com. It introduces Freebase as a topic database containing over 11 million topics. It describes MQL, the JSON query language used to query Freebase. Examples are provided to find Russian cosmonauts and tropical storms from the 1990s. The presentation also discusses hosting apps on FreebaseApps.com and using Acre's templating language and widgets like suggestion.
Hacking the Newsroom
This is the slide deck for a presentation I gave this year at FlashBelt, Flash on the Beach, and Flash on Tap. Despite the names of these conferences, the presentation has nothing to do with Flash.
Here is the session description:
"In February, the New York Times announced that it was giving away the keys to 28 years of data - news stories, movie reviews, obituaries, and political statistics - all for free. Wether the dying gasp of an legendary institution, or the beginnings of an extraordinary rebirth, the release of this vast and historically significant information is a boon to data visualizers, entrepreneurs, social scientists and artists around the world.
In this session, Jer will show a variety of work that he has produced using data from The New York Times and The Guardian newspapers. He'll show how to access this information easily in Flash and Processing, and will share code samples to get you started in explorations of your own. Along the way, he'll attempt to examine how a new era of open data is affecting science, art, and design."
The document summarizes a presentation on linking Civil War data using Linked Open Data techniques. It discusses:
1. The potential for mashups and remixing of cultural heritage data in a linked data context.
2. The growth of the Linked Open Data cloud and importance of linking data from libraries, archives, and museums.
3. The Civil War Data 150 project which aims to link datasets about the American Civil War using a common vocabulary to enable new analyses and visualizations of the data.
This presentation was given by Georges Oates (Flickr) at the seminar Nationaal Archief joins Flickr the Commons on 4 November 2008 in Rotterdam. This project is part of the Dutch digitization project Images for the Future, www.imagesforthefuture.org.
The NoTube BeanCounter: Aggregating User Data for Television Programme Recomm...MODUL Technology GmbH
The document describes The NoTube BeanCounter, which aggregates user data for television programme recommendation. It aligns and enriches various data sources like EPG data, viewer logs, social media profiles, and metadata from sources like IMDB to build user profiles. These profiles are then used by the BeanCounter to provide personalized recommendations of television programmes and series to users based on their preferences and those of their social connections. A prototype recommendation system called iZapper was also developed.
Data Science - The Most Profitable Movie CharacteristicCheah Eng Soon
The document analyzes movie data from the TMDB 5000 Movie Dataset to understand characteristics of profitable movies. It explores relationships between genres, movie types and profits over time. The data is cleaned by merging movie and credits datasets, selecting relevant columns, and filling in missing values for release date and runtime. Three main issues are studied: how genres change over time, the relationship between type and profit, and comparisons between production companies.
This document provides a brief history of the development of computer networks from 1945 to present day. It mentions early pioneers like Vannevar Bush who envisioned associative networks of information called memex in 1945. Ted Nelson coined the term "hypertext" in 1965 to describe interconnected multimedia. The small world experiment in 1967 and Tim Berners-Lee's invention of the World Wide Web in 1991 were important early network developments. More recent research includes scale-free networks and crowdsourcing as networks have grown exponentially with user generated content and open collaboration online.
This document discusses how maps can be used to enrich math tasks. It provides examples of using maps to measure the height of structures by comparing shadows, calculating speed from distance and time measurements in a "speed trap" activity, and exploring zoom levels and dilation factors in map images. Other tips mentioned include finding creative reference sources and measuring objects in multiple ways to improve accuracy. Potential map websites for similar activities are also listed.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
GRASS GIS, Star Trek and old Video Tape – a reference case on audiovisual pre...Peter Löwe
This presentation showcases new options for the preservation of audiovisual content in the OSGeo communities beyond the established software repositories or Youtube. Audiovisual content related to OSGeo projects such as training videos and screencasts can be preserved by advanced multimedia archiving and retrival services which are currently developed by the library community. This is demonstrated by the reference case of a newly discovered high resolution version of the GRASS GIS 1987 promotional video which made available from into the AV-portal of the German National Library of Science and Technology (TIB). The portal allows for extended search capabilities based on enhanced metadata derived by automated video analysis. This is a reference case for future preservation activities regarding semantic-enhanced Web2.0 content from OSGeo projects.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
This document contains information from a Twitter engineering presentation about the Twitter API and core objects like users, timelines, tweets, and the social graph. It includes examples of user and tweet JSON structures, as well as screenshots and links to documentation, code samples, and visualizations related to analyzing tweets and trends on Twitter. The presentation encourages attendees to explore the Twitter API and contact Twitter engineers with any other questions.
Freebase - Semantic Technologies 2010 Code CampJamie Taylor
Freebase is a socially managed, semantic database that provides a rich set of APIs for accessing a wide range of data about the world around us. Getting started with Freebase is quick and easy - there are no API keys and you can make up to 100k queries a day as long as you follow the Creative Commons Attribution license.
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Ian Milligan
This was the second part of a joint presentation I did with Jimmy Lin (Maryland) at the "Web Archiving Collaboration: New Tools and Models" conference at Columbia University, New York NY on 4 June 2015.
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Ian Milligan
This is the second part of a joint presentation I did with Jimmy Lin (University of Maryland) at the "Web Archiving Collaboration: New Tools and Models" conference at Columbia University, New York NY on 4 June 2015.
The document provides an overview of resources for researching animation at the library, including reference books, databases, journals, and websites. It discusses searching the catalog and databases for books and articles on animation history, techniques, studios, films, and animators. Specific animators, films, and resources mentioned include Norman McLaren, Blinkity Blank, the National Film Board of Canada, and Eadweard Muybridge's photographs of galloping horses. Evaluation criteria and sample relevant websites focusing on animation are also listed.
Micha L. Rieser: How GLAM can support WikipediansBeat Estermann
The document discusses how galleries, libraries, archives, and museums (GLAM) can support Wikipedians by providing images and information from their collections. It notes that GLAM institutions often restrict photography and require lengthy permission processes, creating barriers for Wikipedians. The document proposes solutions like GLAM uploading high-quality images under free licenses, communicating directly with Wikipedians, and designating staff as open knowledge experts.
Augmenting RDBMS with MongoDB for ecommerceSteven Francia
Steve Francia, VP of Engineering at OpenSky a NYC based social commerce company, on how OpenSky augments using RDBMS with MongoDB to develop the next ecommerce platform.
OpenSky utilizes both traditional SQL solutions and combines them with NoSQL to overcome the limitations of each, increase development speed and scale quickly.
Pragmatic ethical and fair AI for data scientistsDavid Graus
1. David Graus presented on pragmatic and fair AI for recruitment and news recommendations.
2. He discussed how algorithms can unintentionally learn and reflect human biases around gender and race. However, AI may also help address these biases, such as through representational ranking in recruitment to achieve demographic parity.
3. Graus also explored using editorial values like diversity, dynamism and serendipity to guide news recommendations, and found their system could increase dynamism without loss of accuracy through constrained intervention.
Slidedeck of my lecture at SIKS Course "Advances in Information Retrieval"
Read more here: https://graus.nu/blog/bias-in-recommendations-lecture-siks-course-on-advances-in-ir/
More Related Content
Similar to yourHistory - entity linking for a personalized timeline of historic events
ServerSide Javascript on Freebase - SF JavaScript meetup #9Will Moffat
This document summarizes a presentation about Acre, which allows running server-side JavaScript on Freebase.com. It introduces Freebase as a topic database containing over 11 million topics. It describes MQL, the JSON query language used to query Freebase. Examples are provided to find Russian cosmonauts and tropical storms from the 1990s. The presentation also discusses hosting apps on FreebaseApps.com and using Acre's templating language and widgets like suggestion.
Hacking the Newsroom
This is the slide deck for a presentation I gave this year at FlashBelt, Flash on the Beach, and Flash on Tap. Despite the names of these conferences, the presentation has nothing to do with Flash.
Here is the session description:
"In February, the New York Times announced that it was giving away the keys to 28 years of data - news stories, movie reviews, obituaries, and political statistics - all for free. Wether the dying gasp of an legendary institution, or the beginnings of an extraordinary rebirth, the release of this vast and historically significant information is a boon to data visualizers, entrepreneurs, social scientists and artists around the world.
In this session, Jer will show a variety of work that he has produced using data from The New York Times and The Guardian newspapers. He'll show how to access this information easily in Flash and Processing, and will share code samples to get you started in explorations of your own. Along the way, he'll attempt to examine how a new era of open data is affecting science, art, and design."
The document summarizes a presentation on linking Civil War data using Linked Open Data techniques. It discusses:
1. The potential for mashups and remixing of cultural heritage data in a linked data context.
2. The growth of the Linked Open Data cloud and importance of linking data from libraries, archives, and museums.
3. The Civil War Data 150 project which aims to link datasets about the American Civil War using a common vocabulary to enable new analyses and visualizations of the data.
This presentation was given by Georges Oates (Flickr) at the seminar Nationaal Archief joins Flickr the Commons on 4 November 2008 in Rotterdam. This project is part of the Dutch digitization project Images for the Future, www.imagesforthefuture.org.
The NoTube BeanCounter: Aggregating User Data for Television Programme Recomm...MODUL Technology GmbH
The document describes The NoTube BeanCounter, which aggregates user data for television programme recommendation. It aligns and enriches various data sources like EPG data, viewer logs, social media profiles, and metadata from sources like IMDB to build user profiles. These profiles are then used by the BeanCounter to provide personalized recommendations of television programmes and series to users based on their preferences and those of their social connections. A prototype recommendation system called iZapper was also developed.
Data Science - The Most Profitable Movie CharacteristicCheah Eng Soon
The document analyzes movie data from the TMDB 5000 Movie Dataset to understand characteristics of profitable movies. It explores relationships between genres, movie types and profits over time. The data is cleaned by merging movie and credits datasets, selecting relevant columns, and filling in missing values for release date and runtime. Three main issues are studied: how genres change over time, the relationship between type and profit, and comparisons between production companies.
This document provides a brief history of the development of computer networks from 1945 to present day. It mentions early pioneers like Vannevar Bush who envisioned associative networks of information called memex in 1945. Ted Nelson coined the term "hypertext" in 1965 to describe interconnected multimedia. The small world experiment in 1967 and Tim Berners-Lee's invention of the World Wide Web in 1991 were important early network developments. More recent research includes scale-free networks and crowdsourcing as networks have grown exponentially with user generated content and open collaboration online.
This document discusses how maps can be used to enrich math tasks. It provides examples of using maps to measure the height of structures by comparing shadows, calculating speed from distance and time measurements in a "speed trap" activity, and exploring zoom levels and dilation factors in map images. Other tips mentioned include finding creative reference sources and measuring objects in multiple ways to improve accuracy. Potential map websites for similar activities are also listed.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
GRASS GIS, Star Trek and old Video Tape – a reference case on audiovisual pre...Peter Löwe
This presentation showcases new options for the preservation of audiovisual content in the OSGeo communities beyond the established software repositories or Youtube. Audiovisual content related to OSGeo projects such as training videos and screencasts can be preserved by advanced multimedia archiving and retrival services which are currently developed by the library community. This is demonstrated by the reference case of a newly discovered high resolution version of the GRASS GIS 1987 promotional video which made available from into the AV-portal of the German National Library of Science and Technology (TIB). The portal allows for extended search capabilities based on enhanced metadata derived by automated video analysis. This is a reference case for future preservation activities regarding semantic-enhanced Web2.0 content from OSGeo projects.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
This document contains information from a Twitter engineering presentation about the Twitter API and core objects like users, timelines, tweets, and the social graph. It includes examples of user and tweet JSON structures, as well as screenshots and links to documentation, code samples, and visualizations related to analyzing tweets and trends on Twitter. The presentation encourages attendees to explore the Twitter API and contact Twitter engineers with any other questions.
Freebase - Semantic Technologies 2010 Code CampJamie Taylor
Freebase is a socially managed, semantic database that provides a rich set of APIs for accessing a wide range of data about the world around us. Getting started with Freebase is quick and easy - there are no API keys and you can make up to 100k queries a day as long as you follow the Creative Commons Attribution license.
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Ian Milligan
This was the second part of a joint presentation I did with Jimmy Lin (Maryland) at the "Web Archiving Collaboration: New Tools and Models" conference at Columbia University, New York NY on 4 June 2015.
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...Ian Milligan
This is the second part of a joint presentation I did with Jimmy Lin (University of Maryland) at the "Web Archiving Collaboration: New Tools and Models" conference at Columbia University, New York NY on 4 June 2015.
The document provides an overview of resources for researching animation at the library, including reference books, databases, journals, and websites. It discusses searching the catalog and databases for books and articles on animation history, techniques, studios, films, and animators. Specific animators, films, and resources mentioned include Norman McLaren, Blinkity Blank, the National Film Board of Canada, and Eadweard Muybridge's photographs of galloping horses. Evaluation criteria and sample relevant websites focusing on animation are also listed.
Micha L. Rieser: How GLAM can support WikipediansBeat Estermann
The document discusses how galleries, libraries, archives, and museums (GLAM) can support Wikipedians by providing images and information from their collections. It notes that GLAM institutions often restrict photography and require lengthy permission processes, creating barriers for Wikipedians. The document proposes solutions like GLAM uploading high-quality images under free licenses, communicating directly with Wikipedians, and designating staff as open knowledge experts.
Augmenting RDBMS with MongoDB for ecommerceSteven Francia
Steve Francia, VP of Engineering at OpenSky a NYC based social commerce company, on how OpenSky augments using RDBMS with MongoDB to develop the next ecommerce platform.
OpenSky utilizes both traditional SQL solutions and combines them with NoSQL to overcome the limitations of each, increase development speed and scale quickly.
Similar to yourHistory - entity linking for a personalized timeline of historic events (20)
Pragmatic ethical and fair AI for data scientistsDavid Graus
1. David Graus presented on pragmatic and fair AI for recruitment and news recommendations.
2. He discussed how algorithms can unintentionally learn and reflect human biases around gender and race. However, AI may also help address these biases, such as through representational ranking in recruitment to achieve demographic parity.
3. Graus also explored using editorial values like diversity, dynamism and serendipity to guide news recommendations, and found their system could increase dynamism without loss of accuracy through constrained intervention.
Slidedeck of my lecture at SIKS Course "Advances in Information Retrieval"
Read more here: https://graus.nu/blog/bias-in-recommendations-lecture-siks-course-on-advances-in-ir/
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.David Graus
The document summarizes research on recommender systems in the media industry. It discusses how FD Mediagroup uses recommender systems for their SMART Radio and SMART Journalism products. Key aspects of building a recommender system that FD focuses on include relevance, usefulness, and trust. Relevance is evaluated using metrics like NDCG, MAP, and R-Precision. Usefulness considers both algorithmic goals like diversity and business goals. Trust is evaluated based on whether users engage with the recommender system.
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyDavid Graus
Lezing op de VOGIN-IP-lezing op 28 maart 2018 bij de Openbare Bibliotheek Amsterdam.
DISCLAIMER: dit praatje is een mooi stukje ouderwetse (menselijke) manipulatie: expert komt met een 5-tal aanbevelingen :-).
"Tegenwoordig kijkt men steeds vaker met argusogen naar technologiebedrijven die op grote schaal gebruikersgedrag verzamelen. In dit praatje zet ik uiteen waarom het inzetten van gebruikersgedrag van belang is, en hoe het wordt gebruikt om informatie effectief te kunnen ontsluiten en doorzoekbaar maken, of het nu gaat om een zoekmachine als Google, die zich een weg moet banen door een web van miljarden pagina’s, of een service als Spotify, die haar gebruikers graag de juiste muziek blijft aanbieden."
Layman's Talk: Entities of Interest --- Discovery in Digital TracesDavid Graus
The document outlines a program that includes a committee grilling a speaker at 10:00, the committee retreating afterwards, a ceremony at 10:15, and a reception downstairs from 11:00 to 12:30.
Slides of the talk I gave at PyData Amsterdam.
Abstract:
"The FD Mediagroep collects, analyses and filters valuable and relevant information, 24/7, for an influential group of professionals, business executives and high net worth individuals. Company.info (part of FDMG) provides complete, reliable, up-to-date company information and business news about no less than 2.7 million companies and other legal entities in the Netherlands. For Company.info we continuously monitor and crawl hundreds of (online) news sources, resulting in a large archive of (Dutch) business-related news, spanning hundreds of thousands of articles. These articles are automatically enriched, by linking the profiles of companies that are mentioned in the articles, using a custom in-house entity linking framework built in Python. In this talk, I will briefly explain the entity linking task, I will detail the implementation of our custom entity linking framework, and our pipeline for crawling and enriching news articles."
De Macht van Data --- Hoe algoritmen ons leven vormgevenDavid Graus
Slides of the introductory talk I gave at an event at De Balie: "De macht van data" on June 18th, 2017.
For a video recording of the talk see: http://graus.co/blog/mini-college-algoritmen/
Talk I gave at the Data Science Northeast Netherlands Meetup, where I detail the custom in-house entity linking framework, sentiment analysis, and entity salience scoring model we developed for Company.info, in addition to showing some example applications of our corpus of news articles linked to organization profiles.
Dynamic Collective Entity Representations for Entity RankingDavid Graus
This document proposes using collective intelligence to dynamically enrich entity representations from multiple sources like knowledge bases, anchors, tags, and tweets. It presents an adaptive ranking model that learns optimal weights for ranking features like field similarity and importance over time. An experiment on query logs shows expanding entities with different sources improves ranking and retraining the ranker with new content further enhances performance.
Dynamic Collective Entity Representations for Entity RankingDavid Graus
This document proposes using dynamic collective entity representations to improve entity ranking. It describes enriching static entity representations from knowledge bases with descriptions from dynamic sources like tweets, queries, and tags. An adaptive ranking model individually weights each description source and retrains over time using clicks. Experimental results show expanding representations and retraining the ranker improves ranking performance compared to a non-adaptive model, with different sources providing varying benefits depending on their dynamic nature and entity coverage.
David Graus presents his research on using semantic search techniques to improve information retrieval for digital forensic evidence from emails and other electronic documents. He discusses using social network analysis of communication patterns and language models of email content to predict likely recipients of emails. By combining these approaches, he is able to more accurately rank potential recipients than using either technique alone. Future work includes incorporating organizational structure and decay of communication patterns over time.
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus
David Graus from the University of Amsterdam gave a presentation on entity linking at the Search Engines Amsterdam conference on June 27th. He began by defining entity linking as linking mentions of entities in text to their corresponding entities in a knowledge base. He then gave an example of entity linking and discussed ranking entity candidates based on their prior probabilities like link probability and commonness. Finally, he described using both local and global features in supervised learning models to improve entity linking accuracy.
This document discusses research on applying text mining and information retrieval techniques for fact finding in regulatory investigations from electronic documents. The researchers are developing methods for semantic search in e-discovery to iteratively retrieve relevant evidence from emails, forums, and other sources by integrating structural context and extracting knowledge from unstructured text. Their current work includes using Twitter mining as a form of conversational search and entity linking to semantically enrich documents.
Semantic Annotation of the Cyttron DatabaseDavid Graus
Final Presentation for my MSc Graduation Project.
Abstract:
"Semantic annotation uses human knowledge formalized in ontologies to enrich texts, by providing structured and machine-understandable information of its content. This paper proposes an approach for automatically annotating texts of the Cyttron Scientific Image Database, using the NCI Thesaurus ontology. Several frequency-based keyword extraction algorithms were implemented and evaluated, aiming to extract important concepts and exclude less relevant ones. Furthermore, topic classification algorithms were applied to identify important concepts which do not occur in the text. The algorithms were evaluated by comparison to annotations provided by experts. Semantic networks were generated from these annotations and an ontology-based similarity metric was applied to perform the comparison. Finally the networks were visualized to provide further insights into the differences of the semantic structure generated by humans, and the algorithms."
More information: http://graus.nu/category/thesis
UR BHatti Academy dedicated to providing the finest IT courses training in the world. Under the guidance of experienced trainer Usman Rasheed Bhatti, we have established ourselves as a professional online training firm offering unparalleled courses in Pakistan. Our academy is a trailblazer in Dijkot, being the first institute to officially provide training to all students at their preferred schedules, led by real-world industry professionals and Google certified staff.
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISMAJHSSR Journal
ABSTRACT: Huzhou has rich tourism resources, as early as a considerable development since the reform and
opening up, especially in recent years, Huzhou tourism has ushered in a new period of development
opportunities. At present, Huzhou tourism has become one of the most characteristic tourist cities on the East
China tourism line. With the development of Huzhou City, the tourism industry has been further improved, and
the tourism degree of the whole city has further increased the transformation and upgrading of the tourism
industry. However, the development of tourism in Huzhou City still lags far behind the tourism development of
major cities in East China. This round of research mainly analyzes the current development of tourism in
Huzhou City, on the basis of analyzing the specific situation, pointed out that the current development of
Huzhou tourism problems, and then analyzes these problems one by one, and put forward some specific
solutions, so as to promote the further rapid development of tourism in Huzhou City.
KEYWORDS:Huzhou; Travel; Development
yourHistory - entity linking for a personalized timeline of historic events
1. Gaza War
Britches
World War II
Berlin Wall
Woodstock
1950
1900
1910
1970
1920
9/11
Gulf War
1930
1980
1940
1950
1990
1960
BET Hiphop Awards
2000
1970
1980
2010
1990
2000
David Graus, Maria-Hendrike Peetz,
Daan Odijk, Maarten de Rijke, Ork de Rooij
2010
2. Entity Linking for a personalized timeline of historic events
•
Motivation
•
Method
•
•
Part II: Generate User Profile
•
Part III: Matching Events to User Profile
•
•
Part I: Fetch Candidate Historic Events
Part IV: Scoring & Ranking Events
Future Work
3. •
[…] To design and build innovative and robust prototypes and
demos for tools that analyse and/or integrate open web data for
educational purposes.
21. Access Facebook profile
MY FACEBOOK
PROFILE
BIO
POST
POST
LIKES
POST
{
"id":
"1183880085",
"likes":
{
"data":
[
{
"category":
"Musician/band",
"created_time":
"2013-‐10-‐27T11:37:51+0
"name":
"NAS",
"id":
"113591595350795"
},
{
"category":
"Company",
"created_time":
"2013-‐10-‐17T07:45:36+0
"name":
"Infinibase",
"id":
"573216229380347"
},
{
"category":
"Magazine",
"created_time":
"2013-‐10-‐04T13:55:10+0
"name":
"New
Scientist
NL",
"id":
"369158433181445"
},
22. Extract text
attributes
•
•
•
•
•
•
{
"id":
"1183880085",
"likes":
{
"data":
[
{
"category":
"Musician/band",
"created_time":
"2013-‐10-‐27T11:37:51+0000",
"name":
"NAS",
"id":
"113591595350795"
},
{
"category":
"Company",
"created_time":
"2013-‐10-‐17T07:45:36+0000",
"name":
"Infinibase",
"id":
"573216229380347"
},
{
"category":
"Magazine",
"created_time":
"2013-‐10-‐04T13:55:10+0000",
"name":
"New
Scientist
NL",
"id":
"369158433181445"
},
{
"category":
"Tv
show",
"created_time":
"2010-‐05-‐09T01:06:27+0000",
"name":
"The
Wire",
"id":
"5991693871"
}
]
}
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Story
Omroep
Maxim
Gamer01
Breaking
Bad
AT5
Mad
Men
The
Wire
Monty
Python's
Flying
Circus
Flight
of
the
Conchords
Donnie
Darko
Flevopark
Film
Festival
Do
The
Right
Thing
A
Clockwork
Orange
Wild
Style
Princess
Mononoke
The
Fountain
Pi
Northfork
La
Haine
Zen
and
the
Art
of
Motorcycle
Maintenance
Moon
Palace
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Fountainhead
The
Wind-‐Up
Bird
Chronicle
Wu-‐Tang
J.Cole
NAS
Pusha
T
ASAP
Rocky
Ab-‐Soul
Chance
The
Rapper
Cannibal
Ox
Bonobo
Aesop
Rock
Boards
Of
Canada
Jurassic
5
GREMS
Quasimoto
Strange
Journey
Volume
Three
Drop
Velvet
MODESELEKTOR
IAM
Derek
The
Onion
Imgur
De
Speld
Wu-‐Tang
23. •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
ASAP
Rocky
Ab-‐Soul
Chance
The
Rapper
Cannibal
Ox
Bonobo
Aesop
Rock
Boards
Of
Canada
Jurassic
5
GREMS
Quasimoto
Strange
Journey
Volume
Three
Drop
Velvet
MODESELEKTOR
IAM
Derek
The
Onion
Imgur
De
Speld
Wu-‐Tang
J.Cole
I
Am
Fucking
Ambivalent
About
Science
NAS
Pusha
T
ASAP
Rocky
Chrietitie
Infinibase
Marktplaatspoxc3xabzie
Jeannette
Span
:
Spelen
24. Entity Linking
•
Given a Knowledge Base
•
Link mentions of entities (or concepts) to their referent entities
25. Entity Linking
•
From Wikipedia:
•
Extract anchor texts (words used to link to Wikipedia pages)
!
!
!
!
!
!
•
For each n-gram n ↔ Wikipedia page W estimate:
•
Probability of using n-gram n to refer to Wikipedia page W
26. Entity Linking Example
Link Probability
“Nas” occurs 2475x in Wikipedia
!
is anchor
1.723x
is no anchor
752x
27. Entity Linking Example
Link Probability
“Nas” occurs 2475x in Wikipedia
!
is anchor
1723/2475
=
69,6%
is no anchor
752/2475
=
30.4%
28. Entity Linking Example
Commonness
•
Nas is used to refer to:
•
http://en.wikipedia.org/wiki/Nas
•
http://en.wikipedia.org/wiki/Naas
•
http://en.wikipedia.org/wiki/Nås
•
http://en.wikipedia.org/wiki/Nas (Ikaria)
•
http://en.wikipedia.org/wiki/Untitled Nas album
29. Entity Linking Example
Commonness
•
Nas is used to refer to:
•
http://en.wikipedia.org/wiki/Nas
14x
•
http://en.wikipedia.org/wiki/Naas
4x
•
http://en.wikipedia.org/wiki/Nås
3x
•
http://en.wikipedia.org/wiki/Nas (Ikaria)
2x
•
http://en.wikipedia.org/wiki/Untitled Nas album
2x
30. Entity Linking Example
Commonness
•
Nas is used to refer to:
•
http://en.wikipedia.org/wiki/Nas
14/25 =
56%
•
http://en.wikipedia.org/wiki/Naas
4/25 =
1.6%
•
http://en.wikipedia.org/wiki/Nås
3/25 =
1.2%
•
http://en.wikipedia.org/wiki/Nas (Ikaria)
2/25 =
0.8%
•
http://en.wikipedia.org/wiki/Untitled Nas album
2/25 =
0.8%
32. •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
AT5
Mad
Men
The
Wire
Monty
Python's
Flying
Circus
Flight
of
the
Conchords
Donnie
Darko
Flevopark
Film
Festival
Do
The
Right
Thing
A
Clockwork
Orange
Wild
Style
Princess
Mononoke
The
Fountain
Pi
Northfork
La
Haine
Zen
and
the
Art
of
Motorcycle
Maintenance
Moon
Palace
The
Fountainhead
The
Wind-‐Up
Bird
Chronicle
Wu-‐Tang
J.Cole
53. Future Work
•
Log interactions
•
Interpret clicks as (implicit) feedback:
•
Click on Event: user is interested
•
No click on Event: user is not
•
Learn scoring & ranking functions
54. Thank you! Questions?
Try yourHistory:
See our poster:
http://apps.facebook.com/yourHistory
#98
!
!
!
!
David Graus
d.p.graus@uva.nl
@dvdgrs