Revathy Krishnamurthy, Pavan Kapanipathi, Amit P. Sheth, and Krishnaprasad Thirunarayan. Knowledge Enabled Approach to Predict the Location of Twitter Users, Proc. of the Extended Semantic Web Conference, Slovenia, May 3, 2015.
Paper at: http://www.knoesis.org/library/resource.php?id=2039
This document summarizes an approach to entity recommendations using hierarchical knowledge bases. It proposes using spreading activation theory on Wikipedia's category hierarchy to model user interests and generate recommendations. The approach transforms Wikipedia's category structure into a hierarchy and uses spreading activation to calculate interest scores for categories. These scores are then propagated across the hierarchy to score and rank entities. The approach is evaluated on movie recommendations and shows improved recall over baseline methods, particularly when incorporating category priority weights. Future work to improve normalization and validate category priorities is discussed.
The document describes the EU Project Networking Session 2015 that was held on June 3rd 2015 in Portoroz, Slovenia. The session provided an opportunity for EU projects to connect, discuss their research, and identify opportunities for collaboration. The session included one minute "madness presentations" from various projects, a poster session to showcase projects, and thematic tables to facilitate discussions. The purpose was to enable knowledge sharing, technology transfer, and potential future collaborations between EU projects.
ESWC2015 - Tutorial on Publishing and Interlinking Linked Geospatial DataKostis Kyzirakos
In this tutorial we present the life cycle of linked geospatial data and we focus on two important steps: the publication of geospatial data as RDF graphs and interlinking them with each other. Given the proliferation of geospatial information on the Web many kinds of geospatial data are now becoming available as linked datasets (e.g., Google and Bing maps, user-generated geospatial content, public sector information published as open data etc.). The topic of the tutorial is related to all core research areas of the Semantic Web (e.g., semantic information extraction, transformation of data into RDF graphs, interlinking linked data etc.) since there is often a need to re-consider existing core techniques when we deal with geospatial information. Thus, it is timely to train Semantic Web researchers, especially the ones that are in the early stages of their careers, on the state of the art of this area and invite them to contribute to it.
In this tutorial we give a comprehensive background on data models, query languages, implemented systems for linked geospatial data, and we discuss recent approaches on publishing and interlinking geospatial data. The tutorial is complemented with a hands-on session that will familiarize the audience with the state-of-the-art tools in publishing and interlinking geospatial information.
http://event.cwi.nl/eswc2015-geo/
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryPanos Alexopoulos
Named Entity Resolution (NER) is an information extraction task that involves detecting mentions of named entities within texts and mapping them to
their corresponding entities in a given knowledge resource. Systems and frameworks for performing NER have been developed both by the academia and the industry with different features and capabilities. Nevertheless, what all approaches have in common is that their satisfactory performance in a given scenario does not constitute a trustworthy predictor of their performance in a different one, the reason being the scenario’s different characteristics (target entities, input texts, domain knowledge etc.). With that in mind, we describe a metric-based Diagnostic Framework that can be used to identify the causes behind the low performance of NER systems in industrial settings and take appropriate actions to increase it.
Linked Open Data Principles, benefits of LOD for sustainable developmentMartin Kaltenböck
Presentation held on 18.09.2013 at the OKCon 2013 in Geneva, Switzerland in the course of the workshop: How Linked Open data supports Sustainable Development and Climate Change Development by Martin Kaltenböck (SWC), Florian Bauer (REEEP) and Jens Laustsen (GBPN).
Wenbo Wang defended his PhD dissertation on automatic emotion identification from text. His dissertation focused on three areas: 1) Emotion classification using machine learning techniques to identify emotions from suicide notes and tweets. 2) Creating large self-labeled emotion datasets by leveraging hashtags on Twitter. 3) Adapting emotion identification models to new domains by selecting informative tweets to add to limited labeled data in target domains like blogs. The goal was to improve identification by utilizing large Twitter datasets while addressing challenges of limited labeled data in other domains.
This document discusses using semantic technologies to address the variety challenge of big data. It provides examples of applying semantic annotation to social data and metadata. Specifically, it describes how semantic annotation can extract meaningful metadata from social media posts, including information about users, content, relationships between users, and activity networks. The document outlines different types of metadata that can be derived from social media content, users, and networks.
Best Paper Award winning paper presented at ASONAM 2015.
Derek Doran, Samir Yelne, Luisa Massari, Maria-Carla Calzarossa, LaTrelle Jackson, Glen MoriartyDept. of CSE, Professional Psych, Wright State University, USADept. of Electrical, Computer, and Biomedical Eng., University of Pavia, Italy
7 Cups of Tea, Inc.
http://knoesis.wright.edu/doran
This document summarizes an approach to entity recommendations using hierarchical knowledge bases. It proposes using spreading activation theory on Wikipedia's category hierarchy to model user interests and generate recommendations. The approach transforms Wikipedia's category structure into a hierarchy and uses spreading activation to calculate interest scores for categories. These scores are then propagated across the hierarchy to score and rank entities. The approach is evaluated on movie recommendations and shows improved recall over baseline methods, particularly when incorporating category priority weights. Future work to improve normalization and validate category priorities is discussed.
The document describes the EU Project Networking Session 2015 that was held on June 3rd 2015 in Portoroz, Slovenia. The session provided an opportunity for EU projects to connect, discuss their research, and identify opportunities for collaboration. The session included one minute "madness presentations" from various projects, a poster session to showcase projects, and thematic tables to facilitate discussions. The purpose was to enable knowledge sharing, technology transfer, and potential future collaborations between EU projects.
ESWC2015 - Tutorial on Publishing and Interlinking Linked Geospatial DataKostis Kyzirakos
In this tutorial we present the life cycle of linked geospatial data and we focus on two important steps: the publication of geospatial data as RDF graphs and interlinking them with each other. Given the proliferation of geospatial information on the Web many kinds of geospatial data are now becoming available as linked datasets (e.g., Google and Bing maps, user-generated geospatial content, public sector information published as open data etc.). The topic of the tutorial is related to all core research areas of the Semantic Web (e.g., semantic information extraction, transformation of data into RDF graphs, interlinking linked data etc.) since there is often a need to re-consider existing core techniques when we deal with geospatial information. Thus, it is timely to train Semantic Web researchers, especially the ones that are in the early stages of their careers, on the state of the art of this area and invite them to contribute to it.
In this tutorial we give a comprehensive background on data models, query languages, implemented systems for linked geospatial data, and we discuss recent approaches on publishing and interlinking geospatial data. The tutorial is complemented with a hands-on session that will familiarize the audience with the state-of-the-art tools in publishing and interlinking geospatial information.
http://event.cwi.nl/eswc2015-geo/
Troubleshooting and Optimizing Named Entity Resolution Systems in the IndustryPanos Alexopoulos
Named Entity Resolution (NER) is an information extraction task that involves detecting mentions of named entities within texts and mapping them to
their corresponding entities in a given knowledge resource. Systems and frameworks for performing NER have been developed both by the academia and the industry with different features and capabilities. Nevertheless, what all approaches have in common is that their satisfactory performance in a given scenario does not constitute a trustworthy predictor of their performance in a different one, the reason being the scenario’s different characteristics (target entities, input texts, domain knowledge etc.). With that in mind, we describe a metric-based Diagnostic Framework that can be used to identify the causes behind the low performance of NER systems in industrial settings and take appropriate actions to increase it.
Linked Open Data Principles, benefits of LOD for sustainable developmentMartin Kaltenböck
Presentation held on 18.09.2013 at the OKCon 2013 in Geneva, Switzerland in the course of the workshop: How Linked Open data supports Sustainable Development and Climate Change Development by Martin Kaltenböck (SWC), Florian Bauer (REEEP) and Jens Laustsen (GBPN).
Wenbo Wang defended his PhD dissertation on automatic emotion identification from text. His dissertation focused on three areas: 1) Emotion classification using machine learning techniques to identify emotions from suicide notes and tweets. 2) Creating large self-labeled emotion datasets by leveraging hashtags on Twitter. 3) Adapting emotion identification models to new domains by selecting informative tweets to add to limited labeled data in target domains like blogs. The goal was to improve identification by utilizing large Twitter datasets while addressing challenges of limited labeled data in other domains.
This document discusses using semantic technologies to address the variety challenge of big data. It provides examples of applying semantic annotation to social data and metadata. Specifically, it describes how semantic annotation can extract meaningful metadata from social media posts, including information about users, content, relationships between users, and activity networks. The document outlines different types of metadata that can be derived from social media content, users, and networks.
Best Paper Award winning paper presented at ASONAM 2015.
Derek Doran, Samir Yelne, Luisa Massari, Maria-Carla Calzarossa, LaTrelle Jackson, Glen MoriartyDept. of CSE, Professional Psych, Wright State University, USADept. of Electrical, Computer, and Biomedical Eng., University of Pavia, Italy
7 Cups of Tea, Inc.
http://knoesis.wright.edu/doran
Semantics Approach to Big Data and Event Processing: an introduction focused on velocity and variety
Prof Emanuele Della Valle - DEIB Politecnico di Milano
The document demonstrates how to use an ontology to semantically query a relational database using Ontop software. It shows how to create an ontology representing the database schema, map the ontology to the database, load sample data, and run SPARQL queries over the ontology to retrieve and infer additional information from the database thanks to the reasoner. The results demonstrate how ontology-based data access can provide a semantic view of relational data and enable richer querying capabilities than what is natively supported by the database.
The document discusses event processing using the Esper engine and Event Processing Language (EPL). It provides an overview of Esper's features for efficient event processing, extensibility as middleware, rich user interface, and high availability. The document then demonstrates EPL using a running example of detecting fires based on sensor data. It shows how to define event types, register queries, use various window types, and detect patterns with temporal operators.
Examples of Applied Semantic Technologies to Solve Variety Challenge of Big Data: Application of Semantic Sensor Network
(SSN) Ontology
Pramod Anantharam - Kno.e.sis
Examples of Real-World Big Data Application Specific examples of velocity challenge and how it is addressed in disaster coordination scenario (e.g., Jammu&Kashmir Floods).
Prof Amit Sheth - Kno.e.sis
Krishnaprasad Thirunarayan, Value-Oriented Big Data Processing with Applications,
Invited Talk, The 2015 International Conference on Collaboration
Technologies and Systems (CTS 2015), June 2015.
This document discusses mastering the velocity dimension of big data through information flow processing (IFP). It presents an agenda that includes an overview of IFP and a modeling framework for data stream management systems (DSMS) and complex event processing (CEP). The modeling framework consists of functional, processing, deployment, interaction, data, time, rule, and language models to characterize different systems.
This document summarizes research on analyzing the social media footprint of street gangs. The researchers collected Twitter data related to a specific Chicago gang and 10 Chicago neighborhoods. They analyzed this data using techniques like spatio-temporal analysis, network analysis, sentiment analysis and profile analysis. Their goals were to monitor gang activities, identify influential members, and evaluate community sentiment from social media posts. Challenges included automatically identifying gang members and detecting conflicts between gangs.
The Internet of Things (IoT) is set to occupy a substantial component of future Internet. The IoT connects sensors and devices that record physical observations to applications and services of the Internet[1]. As a successor to technologies such as RFID and Wireless Sensor Networks (WSN), the IoT has stumbled into vertical silos of proprietary systems, providing little or no interoperability with similar systems. As the IoT represents future state of the Internet, an intelligent and scalable architecture is required to provide connectivity between these silos, enabling discovery of physical sensors and interpretation of messages between the things. This paper proposes a gateway and Semantic Web enabled IoT architecture to provide interoperability between systems, which utilizes established communication and data standards. The Semantic Gateway as Service (SGS) allows translation between messaging protocols such as XMPP, CoAP and MQTT via a multi-protocol proxy architecture. Utilization of broadly accepted specifications such as W3Cs Semantic Sensor Network (SSN) ontology for semantic annotations of sensor data provide semantic interoperability between messages and support semantic reasoning to obtain higher-level actionable knowledge from low-level sensor data.
Link to the paper: http://knoesis.org/library/resource.php?id=2154
Citation:
Pratikkumar Desai, Amit Sheth, Pramod Anantharam, 'Semantic Gateway as a Service architecture for IoT Interoperability', IEEE 4th International Conference on Mobile Services, June 27 - July 2, 2015, New York, USA.
Mastering the variety dimension of Big Data with semantic technologies: high level intro to standards. Focus on variety/interoperability dimension. Prof Amit Sheth
Presentation of Hexoskin Validation for KHealth's Dementia Project
The paper is available at: http://www.knoesis.org/library/resource.php?id=2155
Citation for the paper: T. Banerjee, P. Anantharam, W. L. Romine, L. Lawhorne, A. Sheth, 'Evaluating a Potential Commercial Tool for Healthcare Application for People with Dementia' in Proc. of the Intl Conf on Health Informatics and Medical Systems (HIMS), Las Vegas, July 27-30, 2015.
Wide adoption of smartphones and availability of low-cost sensors has resulted in seamless and continuous monitoring of physiology, environment, and public health notifications. However, personalized digital health and patient empowerment can become a reality only if the complex multisensory and multimodal data is processed within the patient context. Contextual processing of patient data along with personalized medical knowledge can lead to actionable information for better and timely decisions. We present a system called kHealth capable of aggregating multisensory and multimodal data from sensors (passive sensing) and answers to questionnaire (active sensing) from patients with asthma. We present our preliminary data analysis comprising data collected from real patients highlighting the challenges in deploying such an application. The results show strong promise to derive actionable information using a combination of physiological indicators from active and passive sensors that can help doctors determine more precisely the cause, severity, and control level of asthma. Information synthesized from kHealth can be used to alert patients and caregivers for seeking timely clinical assistance to better manage asthma and improve their quality of life.
Paper: http://www.knoesis.org/library/resource.php?id=2153
Citation:
Pramod Anantharam, Tanvi Banerjee, Amit Sheth, Krishnaprasad Thirunarayan, Surendra Marupudi, Vaikunth Sridharan, Shalini G. Forbis, Knowledge-driven Personalized Contextual mHealth Service for Asthma Management in Children , IEEE 4th International Conference on Mobile Services, June 27 - July 2, 2015, New York, USA.
Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges. This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks. The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award.
ESWC 2015 Closing and "General Chair's minute of Madness"Fabien Gandon
The document summarizes the closing speech of the 12th European Semantic Web Conference held from May 31st to June 4th, 2015 in Portoroz, Slovenia. It recognizes award winners in various categories including best papers, challenges, and demos. It also announces details about the upcoming ESWC 2015 summer school and ESWC 2016 conference in Crete. The speaker emphasizes that semantic web technologies can effectively handle large volumes of multilingual data, optimize queries and reasoning, support new devices and applications, and enable predictive and collaborative capabilities.
Krishnaprasad Thirunarayan, Trust Management: Multimodal Data Perspective,
Invited Tutorial, The 2015 International Conference on Collaboration
Technologies and Systems (CTS 2015), June 2015
As the popularity of online social networking sites such as Twitter and Facebook continues to rise, the volume of textual content generated on the web is increasing rapidly. The mining of user generated content in social media has proven effective in domains ranging from personalization and recommendation systems to crisis management. These applications stand to be further enhanced by incorporating information about the geo-position of social media users in their analysis.
Due to privacy concerns, users are largely reluctant to share their location information. As a consequence of this, researchers have focused on automatic inferring of location information from the contents of a user's tweets. Existing approaches are purely data-driven and require large training data sets of geo-tagged tweets. Furthermore, these approaches rely solely on social media features or probabilistic language models and fail to capture the underlying semantics of the tweets.
In this thesis, we propose a novel knowledge based approach that does not require any training data. Our approach uses Wikipedia, a crowd sourced knowledge base, to extract entities that are relevant to a location. We refer to these entities as local entities. Additionally, we score the relevance of each local entity with respect to the city, using the Wikipedia Hyperlink Graph. We predict the most likely location of the user by matching the scored entities of a city and the entities mentioned by users in their tweets. We evaluate our approach on a publicly available dataset consisting of 5119 Twitter users across continental United States and show comparable accuracy to the state-of-the-art approaches. Our results demonstrate the ability to pinpoint the location of a Twitter user to a state and a city using Wikipedia, without needing to train a probabilistic model.
Invited talk at Session on Semantic Knowledge for Commodity Computing, at Microsoft Research Faculty Summit 2011, July 19-20, 2011, Redmond, WA. http://research.microsoft.com/en-us/events/fs2011/default.aspx
Associated video at: https://youtu.be/HKqpuLiMXRs
A 2016 overview of the technology & venture capital industries in Los Angeles presented by Mark Suster, Managing Partner of Upfront Ventures for the Mayor's LP / VC Summit.
Shared data infrastructures from smart cities to educationMathieu d'Aquin
This document discusses shared data infrastructures for smart cities and education. It outlines the challenges of data heterogeneity and diversity that arise from integrating multiple datasets from different sources. It proposes taking a linked data approach to query disparate data sources virtually through templates rather than fully integrating the data into a single warehouse. This allows new data sources to be added more easily. It also advocates using semantic representations of data policies and licenses to help navigate different access conditions. Examples are provided of applications developed for the city of Milton Keynes that integrate hundreds of datasets through an "Entity API" to provide insights. Similar solutions are suggested for educational data integration and analytics.
Revealing social bot communities through coordinated behaviourDerek Weber
Presented at the 5th Australian Social Network Analysis Conference (ASNAC) on 26 November 2020. Co-authored with Mehwish Nasim (Data61, CSIRO), Lucia Falzon (DST Group, Uni Melbourne) and Lewis Mitchell (Uni Melbourne, DST Group).
Efforts to influence public opinion online, especially during times of political relevance, such as election campaigns, have grown since first observed in 2010, and are feared to be a particular threat to the upcoming US Presidential election. A significant component of such efforts has consisted of the use of social bots to quickly disseminate vast amounts of polarizing information, propaganda and biased opinion. As social bots are intended to mimic humans on social media, it is often difficult for other humans to identify them easily, but as there are also legitimate uses for online automation, the social media platforms also struggle to contain them, especially with the vast number of users they manage. Previous research has developed methods to detect influence campaigns in general, as well as specifically focusing on identifying social bots, including examining how they interact with other accounts and influence the broader political discussion.
In this talk, we discuss preliminary results from analysis of Twitter activity over the recent 2020 Democratic and Republican National Conventions, at which the parties formally nominated their candidates for President and Vice President. Each convention ran for four days, during which we collected 3m tweets. In particular, we apply techniques for discovering highly coordinating communities based on potentially coordinated behaviours: co-retweeting, co-mentioning of hashtags, and URL sharing. In doing so, we reveal groups of accounts engaging in potentially inauthentic behaviour, and identify classes of participating accounts, including social bots, campaign accounts, news accounts, and regular Twitter users. A variety of analyses of content and temporal patterns exhibited by the communities provide qualitative and quantitative validation, along with discussion of different behaviour patterns observed between the conventions. The ultimate aim is to distinguish between legitimate use of online influence activities (e.g., by political parties and grass roots campaigns) from covert malicious ones.
Semantics Approach to Big Data and Event Processing: an introduction focused on velocity and variety
Prof Emanuele Della Valle - DEIB Politecnico di Milano
The document demonstrates how to use an ontology to semantically query a relational database using Ontop software. It shows how to create an ontology representing the database schema, map the ontology to the database, load sample data, and run SPARQL queries over the ontology to retrieve and infer additional information from the database thanks to the reasoner. The results demonstrate how ontology-based data access can provide a semantic view of relational data and enable richer querying capabilities than what is natively supported by the database.
The document discusses event processing using the Esper engine and Event Processing Language (EPL). It provides an overview of Esper's features for efficient event processing, extensibility as middleware, rich user interface, and high availability. The document then demonstrates EPL using a running example of detecting fires based on sensor data. It shows how to define event types, register queries, use various window types, and detect patterns with temporal operators.
Examples of Applied Semantic Technologies to Solve Variety Challenge of Big Data: Application of Semantic Sensor Network
(SSN) Ontology
Pramod Anantharam - Kno.e.sis
Examples of Real-World Big Data Application Specific examples of velocity challenge and how it is addressed in disaster coordination scenario (e.g., Jammu&Kashmir Floods).
Prof Amit Sheth - Kno.e.sis
Krishnaprasad Thirunarayan, Value-Oriented Big Data Processing with Applications,
Invited Talk, The 2015 International Conference on Collaboration
Technologies and Systems (CTS 2015), June 2015.
This document discusses mastering the velocity dimension of big data through information flow processing (IFP). It presents an agenda that includes an overview of IFP and a modeling framework for data stream management systems (DSMS) and complex event processing (CEP). The modeling framework consists of functional, processing, deployment, interaction, data, time, rule, and language models to characterize different systems.
This document summarizes research on analyzing the social media footprint of street gangs. The researchers collected Twitter data related to a specific Chicago gang and 10 Chicago neighborhoods. They analyzed this data using techniques like spatio-temporal analysis, network analysis, sentiment analysis and profile analysis. Their goals were to monitor gang activities, identify influential members, and evaluate community sentiment from social media posts. Challenges included automatically identifying gang members and detecting conflicts between gangs.
The Internet of Things (IoT) is set to occupy a substantial component of future Internet. The IoT connects sensors and devices that record physical observations to applications and services of the Internet[1]. As a successor to technologies such as RFID and Wireless Sensor Networks (WSN), the IoT has stumbled into vertical silos of proprietary systems, providing little or no interoperability with similar systems. As the IoT represents future state of the Internet, an intelligent and scalable architecture is required to provide connectivity between these silos, enabling discovery of physical sensors and interpretation of messages between the things. This paper proposes a gateway and Semantic Web enabled IoT architecture to provide interoperability between systems, which utilizes established communication and data standards. The Semantic Gateway as Service (SGS) allows translation between messaging protocols such as XMPP, CoAP and MQTT via a multi-protocol proxy architecture. Utilization of broadly accepted specifications such as W3Cs Semantic Sensor Network (SSN) ontology for semantic annotations of sensor data provide semantic interoperability between messages and support semantic reasoning to obtain higher-level actionable knowledge from low-level sensor data.
Link to the paper: http://knoesis.org/library/resource.php?id=2154
Citation:
Pratikkumar Desai, Amit Sheth, Pramod Anantharam, 'Semantic Gateway as a Service architecture for IoT Interoperability', IEEE 4th International Conference on Mobile Services, June 27 - July 2, 2015, New York, USA.
Mastering the variety dimension of Big Data with semantic technologies: high level intro to standards. Focus on variety/interoperability dimension. Prof Amit Sheth
Presentation of Hexoskin Validation for KHealth's Dementia Project
The paper is available at: http://www.knoesis.org/library/resource.php?id=2155
Citation for the paper: T. Banerjee, P. Anantharam, W. L. Romine, L. Lawhorne, A. Sheth, 'Evaluating a Potential Commercial Tool for Healthcare Application for People with Dementia' in Proc. of the Intl Conf on Health Informatics and Medical Systems (HIMS), Las Vegas, July 27-30, 2015.
Wide adoption of smartphones and availability of low-cost sensors has resulted in seamless and continuous monitoring of physiology, environment, and public health notifications. However, personalized digital health and patient empowerment can become a reality only if the complex multisensory and multimodal data is processed within the patient context. Contextual processing of patient data along with personalized medical knowledge can lead to actionable information for better and timely decisions. We present a system called kHealth capable of aggregating multisensory and multimodal data from sensors (passive sensing) and answers to questionnaire (active sensing) from patients with asthma. We present our preliminary data analysis comprising data collected from real patients highlighting the challenges in deploying such an application. The results show strong promise to derive actionable information using a combination of physiological indicators from active and passive sensors that can help doctors determine more precisely the cause, severity, and control level of asthma. Information synthesized from kHealth can be used to alert patients and caregivers for seeking timely clinical assistance to better manage asthma and improve their quality of life.
Paper: http://www.knoesis.org/library/resource.php?id=2153
Citation:
Pramod Anantharam, Tanvi Banerjee, Amit Sheth, Krishnaprasad Thirunarayan, Surendra Marupudi, Vaikunth Sridharan, Shalini G. Forbis, Knowledge-driven Personalized Contextual mHealth Service for Asthma Management in Children , IEEE 4th International Conference on Mobile Services, June 27 - July 2, 2015, New York, USA.
Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges. This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks. The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award.
ESWC 2015 Closing and "General Chair's minute of Madness"Fabien Gandon
The document summarizes the closing speech of the 12th European Semantic Web Conference held from May 31st to June 4th, 2015 in Portoroz, Slovenia. It recognizes award winners in various categories including best papers, challenges, and demos. It also announces details about the upcoming ESWC 2015 summer school and ESWC 2016 conference in Crete. The speaker emphasizes that semantic web technologies can effectively handle large volumes of multilingual data, optimize queries and reasoning, support new devices and applications, and enable predictive and collaborative capabilities.
Krishnaprasad Thirunarayan, Trust Management: Multimodal Data Perspective,
Invited Tutorial, The 2015 International Conference on Collaboration
Technologies and Systems (CTS 2015), June 2015
As the popularity of online social networking sites such as Twitter and Facebook continues to rise, the volume of textual content generated on the web is increasing rapidly. The mining of user generated content in social media has proven effective in domains ranging from personalization and recommendation systems to crisis management. These applications stand to be further enhanced by incorporating information about the geo-position of social media users in their analysis.
Due to privacy concerns, users are largely reluctant to share their location information. As a consequence of this, researchers have focused on automatic inferring of location information from the contents of a user's tweets. Existing approaches are purely data-driven and require large training data sets of geo-tagged tweets. Furthermore, these approaches rely solely on social media features or probabilistic language models and fail to capture the underlying semantics of the tweets.
In this thesis, we propose a novel knowledge based approach that does not require any training data. Our approach uses Wikipedia, a crowd sourced knowledge base, to extract entities that are relevant to a location. We refer to these entities as local entities. Additionally, we score the relevance of each local entity with respect to the city, using the Wikipedia Hyperlink Graph. We predict the most likely location of the user by matching the scored entities of a city and the entities mentioned by users in their tweets. We evaluate our approach on a publicly available dataset consisting of 5119 Twitter users across continental United States and show comparable accuracy to the state-of-the-art approaches. Our results demonstrate the ability to pinpoint the location of a Twitter user to a state and a city using Wikipedia, without needing to train a probabilistic model.
Invited talk at Session on Semantic Knowledge for Commodity Computing, at Microsoft Research Faculty Summit 2011, July 19-20, 2011, Redmond, WA. http://research.microsoft.com/en-us/events/fs2011/default.aspx
Associated video at: https://youtu.be/HKqpuLiMXRs
A 2016 overview of the technology & venture capital industries in Los Angeles presented by Mark Suster, Managing Partner of Upfront Ventures for the Mayor's LP / VC Summit.
Shared data infrastructures from smart cities to educationMathieu d'Aquin
This document discusses shared data infrastructures for smart cities and education. It outlines the challenges of data heterogeneity and diversity that arise from integrating multiple datasets from different sources. It proposes taking a linked data approach to query disparate data sources virtually through templates rather than fully integrating the data into a single warehouse. This allows new data sources to be added more easily. It also advocates using semantic representations of data policies and licenses to help navigate different access conditions. Examples are provided of applications developed for the city of Milton Keynes that integrate hundreds of datasets through an "Entity API" to provide insights. Similar solutions are suggested for educational data integration and analytics.
Revealing social bot communities through coordinated behaviourDerek Weber
Presented at the 5th Australian Social Network Analysis Conference (ASNAC) on 26 November 2020. Co-authored with Mehwish Nasim (Data61, CSIRO), Lucia Falzon (DST Group, Uni Melbourne) and Lewis Mitchell (Uni Melbourne, DST Group).
Efforts to influence public opinion online, especially during times of political relevance, such as election campaigns, have grown since first observed in 2010, and are feared to be a particular threat to the upcoming US Presidential election. A significant component of such efforts has consisted of the use of social bots to quickly disseminate vast amounts of polarizing information, propaganda and biased opinion. As social bots are intended to mimic humans on social media, it is often difficult for other humans to identify them easily, but as there are also legitimate uses for online automation, the social media platforms also struggle to contain them, especially with the vast number of users they manage. Previous research has developed methods to detect influence campaigns in general, as well as specifically focusing on identifying social bots, including examining how they interact with other accounts and influence the broader political discussion.
In this talk, we discuss preliminary results from analysis of Twitter activity over the recent 2020 Democratic and Republican National Conventions, at which the parties formally nominated their candidates for President and Vice President. Each convention ran for four days, during which we collected 3m tweets. In particular, we apply techniques for discovering highly coordinating communities based on potentially coordinated behaviours: co-retweeting, co-mentioning of hashtags, and URL sharing. In doing so, we reveal groups of accounts engaging in potentially inauthentic behaviour, and identify classes of participating accounts, including social bots, campaign accounts, news accounts, and regular Twitter users. A variety of analyses of content and temporal patterns exhibited by the communities provide qualitative and quantitative validation, along with discussion of different behaviour patterns observed between the conventions. The ultimate aim is to distinguish between legitimate use of online influence activities (e.g., by political parties and grass roots campaigns) from covert malicious ones.
The document provides an overview and analysis of leading smart city projects in the United States. It identifies Portland and Seattle as initial cities for a field trip by a Finnish delegation due to their high scores across metrics relevant to smart city development. Relevant smart city cases from Oregon and Washington are highlighted, including systems modeling in Portland, sustainability tools in Tacoma, and the Living Building Challenge framework. The document proposes broadening the field trip to include Anchorage, representing the Cascadia region of North America as a logical place to start Finnish-American smart city networking.
Wellbeing Toronto is a dynamic map visualization tool that helps evaluate community wellbeing across Toronto's 140 neighbourhoods on a number of factors including as crime, transportation and housing. It’s used by decision-makers that need data to support neighbourhood level planning, residents that want information to better understand the communities they live, work, and play in; and businesses needing indicators to learn more about their customers.
But it’s more than just a map.
In this session, Wellbeing Toronto Project Manager Mat Krepicz takes you on a tour of Wellbeing Toronto and share candid insights on its development including key lessons learned, mistakes made, and preview what’s next for one of Canada’s most robust community indicator platforms.
The document provides an overview of the Imagine Austin comprehensive planning process being undertaken by the City of Austin to plan for future growth and development. It discusses Austin's past growth, the power of comprehensive plans to shape cities, what the community has said so far in the planning process, and examples of potential growth scenarios. The summary encourages community members to get involved by taking a survey or attending upcoming planning events to help create a vision and plan to guide Austin over the next 20-25 years.
This document provides an overview of geographic information systems (GIS) and mapping tools for non-profits. It discusses how maps can be used for storytelling, advocacy, program delivery, research, fundraising and community mapping. It also covers topics like data sources, tools, stakeholder participation and challenges around data acquisition. Overall, the document serves as an introduction to using maps and GIS for social causes.
Ric Rodriguez - How Information With Transform The Search LandscapeRic Rodriguez
There is a paradigm shift happening in search. Consumers expect answers to their queries and SEO has evolved past optimising specific elements to creating joined-up experiences, however and wherever users look for information. It's vital that as marketers we have an understanding of how this works - often we talk vaguely about "machine learning" and "intent", but we don't really have a clue what it all means.
This presentation is an intro into the knowledge and understanding behind search engines and other answer-based systems. It's based on my understanding and learning as I go (and expect a "true expert in machine learning" would find a number of holes, very quickly!) - but I hope it helps make the complex concepts of ML / AI / Information Retrieval a little more accessible.
For all question and queries, please reach out to me at hello@ricrodriguez.co.uk.
Discussion slides from the recent joint meeting of North Texas municipal, academic, and industry leaders discussing the potential formation of smart region collaborative.
I’ll present the new knowledge discovery tools we are building at Diffeo. Unlike traditional search engines that use keywords, Diffeo provides an in-browser knowledge base that accelerates information gathering about people, companies, chemical compounds, cyber events, or other real world entities. I’ll describe how Diffeo uses active learning to encourage long and deep user interactions in order to recommend new content for in-progress articles. As you write, the search results get better and more interesting, because the system can see more precisely which entity you mean and which you don’t (disambiguation) and also what you don’t know yet about the entity (discovery).
Finally in this presentation I’ll describe our experience organizing the Text REtrieval Conference (TREC) on Knowledge Base Acceleration (KBA) and Dynamic Domain (DD) which are pushing the state of the art in knowledge discovery on large streams. I’ll show you how to access the largest corpus of streaming text data ever released for public evaluations.
The document provides an overview of the Draft Preferred Scenario for the Plan Bay Area 2040, which establishes a 24-year regional vision for growth and investment. It summarizes the key land use and transportation strategies, which include focusing growth in Priority Development Areas, operating and maintaining the existing transportation system, and modernizing and expanding strategically. While the land use pattern meets environmental goals, it does not fully address the region's affordability issues. The Draft Preferred Scenario allocates over 90% of funds to maintenance and modernization of the existing system.
This document summarizes a risk analysis project of cultural resources within floodplains of the Snoqualmie Valley in King County, Washington. The project aims to develop a GIS model and risk analysis matrix to analyze risk exposure from natural hazards like flooding for historic properties and archaeological sites. It outlines the project objectives, methodology, database design, modeling, and mapping of results to provide a risk assessment of cultural resources.
Risk Analysis Of Cultural Resource4th June2guesta56b77
This document summarizes a risk analysis project of cultural resources within floodplains of the Snoqualmie Valley in King County, Washington. The project aims to develop a GIS model and risk analysis matrix to analyze risk exposure from natural hazards like flooding for historic properties and archaeological sites. It outlines the project objectives, methodology, database design, modeling, and mapping of results.
RDA: Are We There Yet?
This document discusses the progress of Resource Description and Access (RDA) since its publication in 2010. It notes recommendations from libraries that tested RDA, including rewriting instructions in plain English and improving the RDA Toolkit. The implementation date for RDA is March 31, 2013. Differences after implementing RDA include lack of abbreviations, more transcription of elements, new MARC fields, and richer authority records. Fully implementing RDA may involve changes to search options and semantic web/linked data approaches. Tips are provided for libraries on deciding when to implement, talking to vendors, and planning training.
Healthy City Community Planning and Development webinarHealthy City
This customized webinar is for individuals working in Community Planning & Development that are interested in learning new strategies and tools to create healthier living environments in our communities. Working within a social justice framework, this webinar will demonstrate useful practices for planners utilizing the HealthyCity.org website. It will focus on how to use HealthyCity.org to promote a deeper understanding of community assets, characteristics, and the physical environment in order to inform and enhance the planning process. It will also highlight successful methods to engage community members in planning efforts, particularly around sharing local knowledge about the built environment. The webinar will also feature a guest presenter from Legal Services of Northern California to share their experience and successes using data and maps for advocacy and community building.
Similar to Knowledge Enabled Location Prediction of Twitter Users (20)
UR BHatti Academy dedicated to providing the finest IT courses training in the world. Under the guidance of experienced trainer Usman Rasheed Bhatti, we have established ourselves as a professional online training firm offering unparalleled courses in Pakistan. Our academy is a trailblazer in Dijkot, being the first institute to officially provide training to all students at their preferred schedules, led by real-world industry professionals and Google certified staff.
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISMAJHSSR Journal
ABSTRACT: Huzhou has rich tourism resources, as early as a considerable development since the reform and
opening up, especially in recent years, Huzhou tourism has ushered in a new period of development
opportunities. At present, Huzhou tourism has become one of the most characteristic tourist cities on the East
China tourism line. With the development of Huzhou City, the tourism industry has been further improved, and
the tourism degree of the whole city has further increased the transformation and upgrading of the tourism
industry. However, the development of tourism in Huzhou City still lags far behind the tourism development of
major cities in East China. This round of research mainly analyzes the current development of tourism in
Huzhou City, on the basis of analyzing the specific situation, pointed out that the current development of
Huzhou tourism problems, and then analyzes these problems one by one, and put forward some specific
solutions, so as to promote the further rapid development of tourism in Huzhou City.
KEYWORDS:Huzhou; Travel; Development
5. News Recommender
Systems
Beavercreek preschool to open in 2015
By Sharon D. Boykin
A $5.1 million preschool in Beavercreek city
Schools district will help accommodate a
growing of student population and reduce
overcrowding, according to school officials.
Ohio’s health exchange to include
more competition
By Randy Tucker
It was just a year ago that the insurance industry
fretted over potential loses from the new
insurance market created by Affordable Care Act.
Recommended for you
WHY IS LOCATION IMPORTANT?
• Targeted advertising
• Opinion Analysis
• Disaster Response
• Location Based
Services
Other applications
5ESWC 2015
7. Geo-tagged Tweets Profile Information
LOCATION PUBLISHED BY USER
• Less than 4% of tweets contain geo-spatial tags
• ~4 out of 5 cases, location field in profile is either
empty or contains invalid information such as “Justin
Bieber’s heart,” even when present, it might be at
state or nation level
7
8. Friends
LOCATION INFERENCE
Followees
8
Just drove around Golden Gate Park two times
trying to get in
Cleveland Browns confuse me. When I give up
on them, they actually show up to play.
Followers
Network based
Content based
ESWC 2015
9. CONTENT BASED APPROACHES
Just drove around Golden Gate Park two times
trying to get in
Cleveland Browns confuse me. When I give up
on them, they actually show up to play.
• Supervised Approaches
• Probabilistic Models – (Cheng, Caverlee, and Lee, 2010)
• Cascading Topic Models – (Eisenstein, Connor, Smith, and Xing, 2010)
• Gaussian Mixture Model – (Chang, Lee, Eltaher, and Lee, 2012)
• Language Models – (Doran, Gokhale, and Dagnino, 2014)
• Ensemble of Statistical and Heuristic Classifiers – (Mahmud, Nichols,
and Drews, 2014)
9
Geographic location of a user
influences the contents of their
tweets
ESWC 2015
10. PROBLEM STATEMENT
10
Predict the location of a Twitter user based on their
tweets, by exploiting Wikipedia to create a location
specific knowledgebase
ESWC 2015
11. KNOWLEDGE-BASE ENABLED APPROACH
San Francisco:
Golden Gate Bridge,
San Francisco 49ers,
San Francisco Chronicle …
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San Francisco
Chronicle
1
Top-k predictions:
San Francisco
Oakland
Palo Alto
11ESWC 2015
14. • Collaborative encyclopedia
• As of 2014, English Wikipedia has 4.6 million articles, 18 billion
pages views and 500 million unique visitors per month.
• Category Structure
• Used for document clustering, tweet classification,
personalization systems etc.
• Link Structure
• Used for word sense disambiguation, semantic relatedness
between terms etc.
WIKIPEDIA
14ESWC 2015
15. • We consider the internal links of location pages as Local Entities of the
city
Local Entities of San Francisco
LOCAL ENTITIES
• While a city does not contain link to itself, we use the city as a local
entity
15ESWC 2015
17. ARE ALL ENTITIES EQUALLY LOCAL?
17ESWC 2015
San Francisco Chronicle
San Francisco ExaminerSF Weekly
CNN BBC
Al Jazeera America
18. • Pointwise Mutual Information – standard measure of
association between two variables
• Assumption is that higher is the localness of an entity with
respect to the city, higher will be the statistical dependence
between them
• Computed as:
where le is the local entity, c is the city, P(le,c) is the joint probability of occurrence of
the city and the local entity in the Wikipedia dump, P(e) and P(c) are the individual
probability of occurrence of the local entity and city respectively.
Association-based Measure
LOCALNESS MEASURE OF ENTITIES
18ESWC 2015
19. Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
19
The Boston Red Sox, a founding member of the
American League of Major League Baseball in
1901..
Boston Red Sox
The Boston Red Sox are an American
professional baseball team based in
Boston, Massachusetts ...
They are members of American League (AL).
Boston
American League
ESWC 2015
20. • Betweenness Centrality (BC) – Measures the importance of a
node relative to the rest of the nodes in the graph
• A high BC score of a vertex in a graph indicates that it lies on
considerable fraction of shortest path connecting others
• Computed as:
where lei, lej, le are local entities of c, σleilej represents the total
number of shortest paths from lei to lej
Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
20ESWC 2015
21. Alcatraz Island
Treasure Island
Alameda Island
Financial District
Market Street
Fisherman’s Wharf
San Francisco 49ers
Cow Hollow
Silicon Valley
South Beach
….
Suspension Bridge
Hyde Street Pier
Irving Morrow
Angelo Rossi
Art Deco
Charles Alton Ellis
Bethlehem Steel
Half Way to Hell Club
International Orange
…
San Francisco Bay
Golden Gate
San Francisco Chronicle
U.S. Route 101
Marin County
Sausalito
Bay Area
…
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
21ESWC 2015
22. • Measures the relatedness between concepts with the intuition
that related concepts are connected to similar entities
• Jaccard Index: Overlap between two sets
Where IL(c) and IL(e) and are the internal links found in the Wikipedia page
of the city c and the local entity le.
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
22ESWC 2015
23. • Tversky Index: Asymmetric similarity measure between two
sets
𝑡𝑖 𝑙𝑒, 𝑐 =
|𝐼𝐿 𝑐 ∩𝐼𝐿 𝑙𝑒 |
𝐼𝐿 𝑐 ∩𝐼𝐿 𝑙𝑒 + α 𝐼𝐿 𝑐 −𝐼𝐿 𝑙𝑒 + β|𝐼𝐿 𝑙𝑒 −𝐼𝐿 𝑐 |
Where 𝐼𝐿(𝑐) and 𝐼𝐿 𝑙𝑒 are the internal links found in the Wikipedia page
of the city 𝑐 and the local entity 𝑙𝑒
• We choose α = 0 and β = 1
• For every entity in the page of a local entity not found in the
page of the city, penalize the local entity
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
23ESWC 2015
25. Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
We use Zemanta for Entity Linking
25ESWC 2015
26. Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San Francisco Chronicle 1
Step 2: Entity Scoring
We use Zemanta for Entity Linking
26ESWC 2015
28. LOCATION PREDICTION
• Compute an aggregate score for each city whose local entities are found
in a user’s tweets
𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑢, 𝑐 =
𝑒∈𝐿𝐸𝑐𝑢
𝑙𝑜𝑐𝑙 𝑐, 𝑒 × 𝑠𝑒
where LE 𝑐𝑢 is the set of local entities of 𝑐 found in the profile of
user 𝑢 , 𝑙𝑜𝑐𝑙(𝑒, 𝑐) is the localness measure of the entity 𝑒 with
respect to city 𝑐
• Rank 𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑢, 𝑐 in descending order to predict the top-k locations
of a user
28ESWC 2015
29. San Francisco International Airport (6),
San Francisco (4), Nob Hill (3), San
Francisco Museum of Modern Art (1),
Beach Blanket Babylon (2), San Francisco
Municipal Railway (4), Golden Gate Park
(1), San Francisco Bay Area (1), SF Weekly
(1), Fox Oakland Theatre (2), Berkley (1),
Green Day (1), Oakland (9), San Francisco
Bay Area (1), The White Stripes (1),
Detroit Metropolitan Wayne County
Airport (1), Detroit Historical Museum
(1), Detroit Red Wings (4), General
Motors (1), Palo Alto (6), SAP AG (8),
Facebook (3), PARC (company) (2), Dell
(1), Google (1), …
LOCATION PREDICTION
San Francisco International Airport (6), San
Francisco (4), Nob Hill (3), San Francisco
Museum of Modern Art (1), Beach Blanket
Babylon (2), San Francisco Municipal Railway
(4), Golden Gate Park (1), San Francisco Bay
Area (1), SF Weekly (1)
14.5531
Fox Oakland Theatre (2), Berkley (1), Green Day
(1), Oakland (9), San Francisco Bay Area (1)
10.7584
The White Stripes (1), Detroit Metropolitan
Wayne County Airport (1), Detroit Historical
Museum (1), Detroit Red Wings (4), General
Motors (1)
8.0600
Palo Alto (6), SAP AG (8), Facebook (3), PARC
(company) (2), Dell (1), Google (1)
6.9175
User Profile Knowledgebase Location
Prediction
Nob Hill 0.48214
SF Weekly 0.1875
Golden Gate Park 0.16783
San Francisco International
Airport 0.06818
…
Fox Oakland Theatre 0.09375
SF Bay Area 0.12972
Green Day 0.02066
…
Detroit Historical
Museum 0.4838
General Motors 0.05538
Detroit Red Wings 0.0232
…
PARC (company) 0.03726
Google 0.04678
Facebook 0.05810
San Francisco
Oakland, CA
Detroit, MI
Palo Alto, CA
29ESWC 2015
30. • All cities of United States with population > 5000 as published in census
estimates of 2012
• 4,661 cities and 500714 local entities
Knowledge base
IMPLEMENTATION
Baseline
• Considers all local entities to be equally local to the city
• Location prediction based only on frequency of entities
30ESWC 2015
31. • Published by Cheng et al.
• Collected from September 2009 to January 2010.
• Contains 5119 active users from continental United States with
approximately 1000 tweets per user.
• User’s location listed in the form of latitude and longitude.
Test Dataset
EVALUATION
31ESWC 2015
32. • Error Distance
Distance between actual location of the user and the estimated
location
• Average Error Distance
Average of error distance of all users in the test dataset
• Accuracy
Percentage of users predicted within 100 miles of their actual
location
Evaluation Metrics
EVALUATION
32ESWC 2015
34. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• PMI is not normalized hence sensitive to the count of the occurrences of local
entities in the Wikipedia corpus
• E.g. PMI of local entities of Glenn Rock, New Jersey is higher than those of
San Francisco
34ESWC 2015
35. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Does a good job of assigning low scores to common entities.
• E.g. community college, National Weather Service, start up company
etc.
• Fails for entities with some relevance to the city but no distinguishing factor
• E.g. IBM with respect to Endicott, New York
35ESWC 2015
37. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard
Index
53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Underperforms for local entities with fewer entities than the city
• E.g. Eureka Valley and California with respect to San Francisco.
37ESWC 2015
39. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky
Index
54.48 429.00 68.72 74.68 79.99
• Best performing localness measure
• Overcomes the disadvantage of Jaccard Index.
• For example: We are able to assign higher localness to Eureka Valley
(0.7096) than California (0.1270) with respect to San Francisco
39ESWC 2015
40. Comparison with Existing Approaches
EVALUATION
Method ACC (%) AED (in miles)
Cheng, Caverlee, and Lee, 2010 51.00 535.56
Chang, Lee, Eltaher, and Lee, 2012 49.9 509.3
Wikipedia based Approach 54.48 429.00
40ESWC 2015
41. CONCLUSION
• Presented a crowd sourced knowledge based approach, that does not
require geo-tagged tweets as a training dataset, to predict the
location of a user
• Introduced the concept of Local Entities and preprocessed Wikipedia
Hyperlink Graph to extract local entities for each city
• Investigated relatedness measures to establish the degree of
association between a local entity and a city
• Evaluated the proposed approach against a benchmark dataset
published by Cheng et al. For 5119 users, we are able to predict the
location of 55% of users within 100 miles with an average error
distance of 429 miles
41ESWC 2015
42. FUTURE WORK
• Compute the confidence score of the prediction based on top-k cities
and count of local entities in tweets
• Investigate other localness measures for score local entities
• Consider semantic types, categories of local entities and weight the
contribution based on types
• Explore other knowledge bases such as Wikitravel and GeoNames
42ESWC 2015
48. Top 100 Cities
EVALUATION
• 2172 users from the dataset are from the top-100 most
populated cities of United States
• 60% users predicted within 100 miles of their actual location
• 54% users predicted exactly at the city level
48ESWC 2015
Editor's Notes
“City of Lights” – nickname of Paris. When we see a text like this, we use that background knowledge we have to interpret the text. Similarly, background knowledge can be used to improve a machine’s ability to understand and interpret text.
Background knowledge is central to the idea of semantic web. Domain specific knowledgebases such as MusicBrainz, UMLS, Dbpedia etc have been used to solve problems such as named entity recognition and disambiguation, relevant document retrieval, data analysis etc. Today I am going to talk about applying background knowledge to predict the geographic location of a user. In this context, the location of a user is their home location at the city level.
Social media has grown rapidly. Many applications have been developed based on Twitter
Associating geographic information with Twitter users can provide value addition to many applications
Location provides “context”
Users can volunteer geographic information through cellphone or profile
Cheng et al. found 21% users contained location as granular as city, state in their profile
There was a need to automatically infer the location of a user
Network based approach uses training data to determine hidden patterns in the communications between a user and his friends. These observations are used to predict the location of a user
Geographic location of a user influences the contents of their tweets
The general idea behind these approaches is to determine the probabilistic distribution of words across a region. Different approaches such as language models, topic models, Gaussian mixture model and ensemble based classifiers have been used for the task of location prediction.
We address these weaknesses by using a knowledge-based approach to extract location specific concepts from Wikipedia.
Content based prediction of the location of a Twitter user by exploiting Wikipedia to create a location specific knowledge-base
This approach consists of three modules – A knowledge base generator that extracts location specific concepts from a crowd sourced knowledgebase. A User profile generator that creates a semantic profile of a user whose location is to be determined and a Location Prediction module that uses the semantic profile and the knowledgebase to predict the top-k locations of a user.
Knowledgebase generator extracts location specific concepts from Wikipedia.
Local Entities: Entities that have a high relatedness to a city and can discriminate between geographic locations
Wikipedia is publicly available. Anyone can edit a Wikipedia article, correct errors and compensate for any biased views.
Wikipedia has been used as background knowledge to solve many problems. At Kno.e.sis, Wikipedia has been used in many applications. Doozer used Wikipedia to create domain specific ontologies, Blooms used Wikipedia for Ontology Alignment and the hierarchical interest graph for a personalization system, was also based on the category structure of Wikipedia.
Rephrase this slide.
So we hypothesize that for each location, the internal links in its Wikipedia page represent entities that are relevant to it and we consider them as Local Entities of the city. In other words, the local entities of a city are all the outgoing links from its Wikipedia page.
For example, consider this snippet from the Wikipedia page of San Francisco. The San Francisco Chronicle is a major daily newspaper in San Francisco. Clearly it is more local to San Francisco than CNN or MSNBC which is are national media outlets.
So now our goal is to score each entity with respect to the city such that the score reflects the localness measure of the entity with respect to the city.
We experiment with measures of three different classes to score the localness of an entity with respect to a city. That is, association based measure, graph based measure and semantic overlap based measure.
Association based measure: Compute relatedness based on their occurrences in a large corpus
PMI is a measure of how much the actual probability of their occurrence differs from what is expected based on the probabilities of their individual occurrences.
Wrt San Francisco:
CNN: 7.94
San Francisco Chronicle: 10.41
Construct a directed graph of local entities for each city
Ulrik Brandes
BC used to compute the importance of an actor in a social network. The importance is measured in terms of shortest paths. For each node, the BC is the sum of the fraction of shortest paths that pass through that node.
So BC is helping us determine which nodes are important in the network of local entities of a city.
Based on the idea that higher is the overlap between concepts found in the Wikipedia pages of a city and an entity higher is the degree of localness of the entity
Jaccard Index is a symmetric measure to compute the semantic overlap between a city and an entity.
But we find that a local entity generally represents a part of the city. For example, Golden Gate Bridge will not completely overlap with all the concepts of San Francisco which contains entities from different categories like Climate, History, Geography etc.
With that in mind, we use Tversky Index which is an asymmetric similarity measure.
Tversky Index is a unidirectional measure of similarity of the local entity with respect to the city.
We penalize the local entity for every concept in its Wikipedia page that is not present in the city
The next module is the user profile generator. We create a semantic profile of each user consisting of Wikipedia entities found in their tweets
We create a semantic profile of each user consisting of Wikipedia entities found in their tweets.
First step in the creation of User Profile is Entity Linking: Identification of entities from tweets and linking them to their Wikipedia article.
Zemanta for Entity Linking
The next step is Entity Scoring i.e. Scoring each local entity in a user’s tweet using frequency of their occurrence.
We are predicting the home location of a user based on a set o their historic tweets.
Using the frequency helps to determine how relevant the entity is with respect to the user.
For example: a football fan may tweet about many football teams, chances are that he will tweet more frequently about the football team of his city
Brief description of the approach
For predicting the location of a user we compute an aggregate score for each city whose local entities are found in a user’s tweets.
We first created a knowledgebase of locations that contains location specific entities along with a score that represents the localness of the entity with respect to the city. Next, for a user whose location is to be predicted we create a profile that consists of entities mentioned in their tweets and their frequency. Finally, we predict the top-k locations of a user.
We evaluated our approach on a test dataset published by Cheng et al. Their test dataset consists of 5119 active users from continental United States with approximately 1000 tweets per user. These users are spread across 569 cities in US. Spammers and bots are cleaned from the dataset to ensure a clean dataset.
We compute the distance as a straight-line distance between a pair of lat-longs using haversine formula
PMI – not normalized, sensitive to count of occurrences
We find that betweenness centrality does a good job of assigning low scores to common entities such as the National Weather Service but it fails for entities which have some relevance to the city but no distinguishing factor.
Endicott, New York is the birthplace of IBM. So the Wikipedia page of Endicott has an entire section dedicated to IBM and this section contains entities very specific to IBM e.g. punched cards and T.J Watson. Because the shortest path between these IBM-specific nodes and other nodes of the city lies through IBM, the betweenness centrality of IBM is very high.
In such cases, local entities scored using betweenness centrality lead to incorrect predictions.
Jaccard Index underperforms for entities with fewer entities than the city.
Example: Eureka Valley is a residential neighbourhood of San Francisco so we would expect it to be more local to SFO than California.
With Tversky, the localness of a local entity only diminishes for entities in its page not present in the city.
Introduction to Twitter and location of a Twitter user
We find that the more local entities we find in a user’s tweet the higher is our ability to predict their location accurately. This is quite intuitive. If you tweet more and give more clues about your location then there is a higher possibility of being able to locate you. On the other hand, if your tweets are specific to a certain topic such as technology or national level or world politics then it is difficult to find evidence that can be used to locate you accurately.
As shown in this graph, the predictions made on the basis of 10 or more entities were able to locate 66% of the users within 100 miles.
From this dataset we selected users from the top-100 most populated cities of US and found that our algorithm was able to locate 60% of the users within 100 miles of their actual location and 54% users at exactly the city level.