Video of the talk: https://www.youtube.com/watch?v=7k-u_TUew3o
Abstract: Social media has experienced immense growth in recent times. These platforms are becoming increasingly common for information seeking and consumption, and as part of its growing popularity, information overload pose a significant challenge to users. For instance, Twitter alone generates around 500 million tweets per day and it is impractical for users to have to parse through such an enormous stream to find information that are interesting to them. This situation necessitates efficient personalized filtering mechanisms for users to consume relevant, interesting information from social media.
Building a personalized filtering system involves understanding users interests and utilizing these interests to deliver relevant information to users. These tasks primarily include analyzing and processing social media text which is challenging due to its shortness in length, and the real-time nature of the medium. The challenges include: (1) Lack of semantic context: Social Media posts are on an average short in length, which provides limited semantic context to perform textual analysis. This is particularly detrimental for topic identification which is a necessary task for mining users interests; (2) Dynamically changing vocabulary: Most social media websites such as Twitter and Facebook generate posts that are of current (timely) interests to the users. Due to this real-time nature, information relevant to dynamic topics of interest evolve reflecting the changes in the real world. This in turn changes the vocabulary associated with these dynamic topics of interest making it harder to filter relevant information; (3) Scalability: The number of users on social media platforms are significantly large, which is difficult for centralized systems to scale to deliver relevant information to users. This dissertation is devoted to exploring semantic techniques and Semantic Web technologies to address the above mentioned challenges in building a personalized information filtering system for social media. Particularly, the necessary semantics (knowledge) is derived from crowd sourced knowledge bases such as Wikipedia to improve context for understanding short-text and dynamic topics on social media.
Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges. This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks. The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award.
ESS Digital Sociology Conference presentation.
I provide an overview of methodological opportunities, challenges, and solutions to consider for sociologists who are thinking about delving into the world of online ethnography.
Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges. This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks. The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award.
ESS Digital Sociology Conference presentation.
I provide an overview of methodological opportunities, challenges, and solutions to consider for sociologists who are thinking about delving into the world of online ethnography.
Netnography: Overview and How to (Schulich School of Business, MBA class, Soc...elpinchito
This is a slide deck used for 'Netnography: Overview & How-to' presentation on Feb. 15, 2012. The presentation (watch the YouTube video below) was a part of the class assignments for "Social Media Marketing" class taught by Robert Kozinets at Schulich School of Business, York University. In this presentation, topics such as why netnography is useful for marketing research and what the researchers have to keep in mind are explored with some specific examples.
The video on the first slide is a teaser for this presentation.
The link to the recorded presentation: https://www.youtube.com/watch?v=UWApBu2ERTU&context=C31c1b83ADOEgsToPDskJO-DQt8ZUtzIA-tdvMiOHd
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...Shalin Hai-Jew
This presentation focuses on how to understand public sentiment through a related-tags content network analysis of public Flickr photos and videos. NodeXL is used to conduct data extractions and visualizations of user-tagged Flickr contents and the resulting “noisy” folksonomies. What mental connections may be made about particular issues based on analysis of text-annotated graphs?
Exploring Machine Learning for Libraries and Archives: Present and FutureBohyun Kim
A conference presentation given by Bohyun Kim, Chief Technology Officer & Professor, University of Rhode Island Libraries, USA for the Bite-sized Internet Librarian International 2021 on September 22, 2021.
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
Research analysts go to Twitter to capture the general trends of public conversations, identify and profile influential accounts, and extract subgroups within larger collectives and larger discourses; they also go to eavesdrop on individual self-talk and individual-to-individual conversations. So what is technically in your tweets, asked Dave Rosenberg famously in a CNET article (2010). The answer: a whole lot more than 140 characters. How are the most influential social media accounts identified through #hashtag graphs? How are themes extracted? How are sentiments understood? How can users be profiled through their Tweetstreams? How can locations be mapped in terms of the Twitter conversations occurring in particular physical areas? How can live and trending issues be identified and categorized in terms of sentiment (positive, negative, and neutral)? This presentation will summarize some of the free and open-source tools as well as commercial and proprietary ones that enable increased knowability.
Objectives: 1. Gain an understanding of key trends in ICT innovation which are influencing/disrupting crisis informatics. 2. Be able to trace these trends through discussions later this semester, and understand their influence and potential. 3. Introduce visualization lab
In Netnography, online observations and interactions are valued as a cultural reflection that yields deep human understanding. Like in Ethnography, Netnography is naturalistic, immersive, descriptive, intuitive, adaptable, and focused on context.
Social Data and Multimedia Analytics for News and Events ApplicationsYiannis Kompatsiaris
The keynote discusses a framework enabling real-time multimedia indexing and search across multiple social media sources. It places particular emphasis on the real-time, social and contextual nature of content and information consumption in order to integrate topic and event detection, mining, search and retrieval, based on aggregation and indexing of shared user-generated multimedia content. User-friendly applications for the News and Events domains have been developed based on these approaches, incorporating novel user-centric media visualisation and browsing methods. The research and development is part of the FP7 EU project SocialSensor.
Content:
Introduction
Motivation – Challenges
SocialSensor Project and Use Cases
Research Approaches
Large-Scale visual search
Clustering
Verification
Demos – Applications
MM News Demo
Clusttour
Thessfest
Conclusions
Virtual Ethnography: Bridging the Gap between Market Research and Social MediaAlterian
While there have been many different applications of social media data in the marketing field, one that is not well known but is arguably the most interesting, is Virtual Ethnography.
Virtual Ethnography is the process of conducting and constructing an ethnography using the virtual, online environment as the site of the research. With Virtual Ethnography, a market researcher can study a community online to gather insights within the context of marketing strategies and/or initiatives.
John Song & Jen Kersey, share their insights into Virtual Ethnography and illustrate them with a case study for the beloved marshmallow candy Peeps . The findings are both entertaining and quite insightful from a marketing perspective.
Video: https://www.youtube.com/watch?v=ZCToaDgxnAs
Abstract:
People's emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.
Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.
Sujan Perera's Dissertation Defense: Friday, August 12, 2016
Ph.D. Committee: Drs. Amit Sheth, Advisor; T.K. Prasad, Michael Raymer, and Pablo Mendes (IBM Research)
Video: https://youtu.be/pbjJ1zb8ayY
ABSTRACT:
Natural language is a powerful tool developed by humans over hundreds of thousands of years. The extensive usage, flexibility of the language, creativity of the human beings, and social, cultural, and economic changes that have taken place in daily life have added new constructs, styles, and features to the language. One such feature of the language is its ability to express ideas, opinions, and facts in an implicit manner. This is a feature that is used extensively in day to day communications in situations such as: 1) expressing sarcasm, 2) when trying to recall forgotten things, 3) when required to convey descriptive information, 4) when emphasizing the features of an entity, and 5) when communicating a common understanding.
Consider the tweet 'New Sandra Bullock astronaut lost in space movie looks absolutely terrifying' and the text snippet extracted from a clinical narrative 'He is suffering from nausea and severe headaches. Dolasteron was prescribed.' The tweet has an implicit mention of the entity Gravity and the clinical text snippet has implicit mention of the relationship between medication Dolasteron and clinical condition nausea. Such implicit references of the entities and the relationships are common occurrences in daily communication and they add unique value to conversations. However, extracting implicit constructs has not received enough attention. This dissertation focuses on extracting implicit entities and relationships from clinical narratives and extracting implicit entities from Tweets.
This dissertation demonstrates manifestations of implicit constructs in text, studies their characteristics, and develops a solution that is capable of extracting implicit factual information from text. The developed solution starts by acquiring relevant knowledge to solve the implicit information extraction problem. The relevant knowledge includes domain knowledge, contextual knowledge, and linguistic knowledge. The acquired knowledge can take different syntactic forms such as a text snippet, structured knowledge represented in standard knowledge representation languages like Resource Description Framework (RDF) or custom formats. Hence, the acquired knowledge is processed to create models that can be understood by machines. Such models provide the infrastructure to perform implicit information extraction of interest.
This dissertation focuses on three different use cases of implicit information and demonstrates the applicability of the developed solution in these use cases. They are:
- implicit entity linking in clinical narratives,
- implicit entity linking in Twitter,
- implicit relationship extraction from clinical narratives.
Netnography: Overview and How to (Schulich School of Business, MBA class, Soc...elpinchito
This is a slide deck used for 'Netnography: Overview & How-to' presentation on Feb. 15, 2012. The presentation (watch the YouTube video below) was a part of the class assignments for "Social Media Marketing" class taught by Robert Kozinets at Schulich School of Business, York University. In this presentation, topics such as why netnography is useful for marketing research and what the researchers have to keep in mind are explored with some specific examples.
The video on the first slide is a teaser for this presentation.
The link to the recorded presentation: https://www.youtube.com/watch?v=UWApBu2ERTU&context=C31c1b83ADOEgsToPDskJO-DQt8ZUtzIA-tdvMiOHd
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...Shalin Hai-Jew
This presentation focuses on how to understand public sentiment through a related-tags content network analysis of public Flickr photos and videos. NodeXL is used to conduct data extractions and visualizations of user-tagged Flickr contents and the resulting “noisy” folksonomies. What mental connections may be made about particular issues based on analysis of text-annotated graphs?
Exploring Machine Learning for Libraries and Archives: Present and FutureBohyun Kim
A conference presentation given by Bohyun Kim, Chief Technology Officer & Professor, University of Rhode Island Libraries, USA for the Bite-sized Internet Librarian International 2021 on September 22, 2021.
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
Research analysts go to Twitter to capture the general trends of public conversations, identify and profile influential accounts, and extract subgroups within larger collectives and larger discourses; they also go to eavesdrop on individual self-talk and individual-to-individual conversations. So what is technically in your tweets, asked Dave Rosenberg famously in a CNET article (2010). The answer: a whole lot more than 140 characters. How are the most influential social media accounts identified through #hashtag graphs? How are themes extracted? How are sentiments understood? How can users be profiled through their Tweetstreams? How can locations be mapped in terms of the Twitter conversations occurring in particular physical areas? How can live and trending issues be identified and categorized in terms of sentiment (positive, negative, and neutral)? This presentation will summarize some of the free and open-source tools as well as commercial and proprietary ones that enable increased knowability.
Objectives: 1. Gain an understanding of key trends in ICT innovation which are influencing/disrupting crisis informatics. 2. Be able to trace these trends through discussions later this semester, and understand their influence and potential. 3. Introduce visualization lab
In Netnography, online observations and interactions are valued as a cultural reflection that yields deep human understanding. Like in Ethnography, Netnography is naturalistic, immersive, descriptive, intuitive, adaptable, and focused on context.
Social Data and Multimedia Analytics for News and Events ApplicationsYiannis Kompatsiaris
The keynote discusses a framework enabling real-time multimedia indexing and search across multiple social media sources. It places particular emphasis on the real-time, social and contextual nature of content and information consumption in order to integrate topic and event detection, mining, search and retrieval, based on aggregation and indexing of shared user-generated multimedia content. User-friendly applications for the News and Events domains have been developed based on these approaches, incorporating novel user-centric media visualisation and browsing methods. The research and development is part of the FP7 EU project SocialSensor.
Content:
Introduction
Motivation – Challenges
SocialSensor Project and Use Cases
Research Approaches
Large-Scale visual search
Clustering
Verification
Demos – Applications
MM News Demo
Clusttour
Thessfest
Conclusions
Virtual Ethnography: Bridging the Gap between Market Research and Social MediaAlterian
While there have been many different applications of social media data in the marketing field, one that is not well known but is arguably the most interesting, is Virtual Ethnography.
Virtual Ethnography is the process of conducting and constructing an ethnography using the virtual, online environment as the site of the research. With Virtual Ethnography, a market researcher can study a community online to gather insights within the context of marketing strategies and/or initiatives.
John Song & Jen Kersey, share their insights into Virtual Ethnography and illustrate them with a case study for the beloved marshmallow candy Peeps . The findings are both entertaining and quite insightful from a marketing perspective.
Video: https://www.youtube.com/watch?v=ZCToaDgxnAs
Abstract:
People's emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.
Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.
Sujan Perera's Dissertation Defense: Friday, August 12, 2016
Ph.D. Committee: Drs. Amit Sheth, Advisor; T.K. Prasad, Michael Raymer, and Pablo Mendes (IBM Research)
Video: https://youtu.be/pbjJ1zb8ayY
ABSTRACT:
Natural language is a powerful tool developed by humans over hundreds of thousands of years. The extensive usage, flexibility of the language, creativity of the human beings, and social, cultural, and economic changes that have taken place in daily life have added new constructs, styles, and features to the language. One such feature of the language is its ability to express ideas, opinions, and facts in an implicit manner. This is a feature that is used extensively in day to day communications in situations such as: 1) expressing sarcasm, 2) when trying to recall forgotten things, 3) when required to convey descriptive information, 4) when emphasizing the features of an entity, and 5) when communicating a common understanding.
Consider the tweet 'New Sandra Bullock astronaut lost in space movie looks absolutely terrifying' and the text snippet extracted from a clinical narrative 'He is suffering from nausea and severe headaches. Dolasteron was prescribed.' The tweet has an implicit mention of the entity Gravity and the clinical text snippet has implicit mention of the relationship between medication Dolasteron and clinical condition nausea. Such implicit references of the entities and the relationships are common occurrences in daily communication and they add unique value to conversations. However, extracting implicit constructs has not received enough attention. This dissertation focuses on extracting implicit entities and relationships from clinical narratives and extracting implicit entities from Tweets.
This dissertation demonstrates manifestations of implicit constructs in text, studies their characteristics, and develops a solution that is capable of extracting implicit factual information from text. The developed solution starts by acquiring relevant knowledge to solve the implicit information extraction problem. The relevant knowledge includes domain knowledge, contextual knowledge, and linguistic knowledge. The acquired knowledge can take different syntactic forms such as a text snippet, structured knowledge represented in standard knowledge representation languages like Resource Description Framework (RDF) or custom formats. Hence, the acquired knowledge is processed to create models that can be understood by machines. Such models provide the infrastructure to perform implicit information extraction of interest.
This dissertation focuses on three different use cases of implicit information and demonstrates the applicability of the developed solution in these use cases. They are:
- implicit entity linking in clinical narratives,
- implicit entity linking in Twitter,
- implicit relationship extraction from clinical narratives.
Understanding users’ latent intents behind search queries is essential for satisfying a user’s search needs. Search intent mining can help search engines to enhance its ranking of search results, enabling new search features like instant answers, personalization, search result diversification, and the recommendation of more relevant ads. Consequently, there has been increasing attention on studying how to effectively mine search intents by analyzing search engine query logs. While state-of-the-art techniques can identify the domain of the queries (e.g. sports, movies, health), identifying domain-specific intent is still an open problem. Among all the topics available on the Internet, health is one of the most important in terms of impact on the user and it is one of the most frequently searched areas. This dissertation presents a knowledge-driven approach for domain-specific search intent mining with a focus on health-related search queries.
First, we identified 14 consumer-oriented health search intent classes based on inputs from focus group studies and based on analyses of popular health websites, literature surveys, and an empirical study of search queries. We defined the problem of classifying millions of health search queries into zero or more intent classes as a multi-label classification problem. Popular machine learning approaches for multi-label classification tasks (namely, problem transformation and algorithm adaptation methods) were not feasible due to the limitation of label data creations and health domain constraints. Another challenge in solving the search intent identification problem was mapping terms used by laymen to medical terms. To address these challenges, we developed a semantics-driven, rule-based search intent mining approach leveraging rich background knowledge encoded in Unified Medical Language System (UMLS) and a crowd sourced encyclopedia (Wikipedia). The approach can identify search intent in a disease-agnostic manner and has been evaluated on three major diseases.
While users often turn to search engines to learn about health conditions, a surprising amount of health information is also shared and consumed via social media, such as public social platforms like Twitter. Although Twitter is an excellent information source, the identification of informative tweets from the deluge of tweets is the major challenge. We used a hybrid approach consisting of supervised machine learning, rule-based classifiers, and biomedical domain knowledge to facilitate the retrieval of relevant and reliable health information shared on Twitter in real time. Furthermore, we extended our search intent mining algorithm to classify health-related tweets into health categories. Finally, we performed a large-scale study to compare health search intents and features that contribute in the expression of search intent from 100+ million search queries from smarts devices (smartphones/tablets) and personal computers (desktops/laptops)
There is a rapid intertwining of sensors and mobile devices into the fabric of our lives. This has resulted in unprecedented growth in the number of observations from the physical and social worlds reported in the cyber world. Sensing and computational components embedded in the physical world is termed as Cyber-Physical System (CPS). Current science of CPS is yet to effectively integrate citizen observations in CPS analysis. We demonstrate the role of citizen observations in CPS and propose a novel approach to perform a holistic analysis of machine and citizen sensor observations. Specifically, we demonstrate the complementary, corroborative, and timely aspects of citizen sensor observations compared to machine sensor observations in Physical-Cyber-Social (PCS) Systems.
Physical processes are inherently complex and embody uncertainties. They manifest as machine and citizen sensor observations in PCS Systems. We propose a generic framework to move from observations to decision-making and actions in PCS systems consisting of: (a) PCS event extraction, (b) PCS event understanding, and (c) PCS action recommendation. We demonstrate the role of Probabilistic Graphical Models (PGMs) as a unified framework to deal with uncertainty, complexity, and dynamism that help translate observations into actions. Data driven approaches alone are not guaranteed to be able to synthesize PGMs reflecting real-world dependencies accurately. To overcome this limitation, we propose to empower PGMs using the declarative domain knowledge. Specifically, we propose four techniques: (a) automatic creation of massive training data for Conditional Random Fields (CRFs) using domain knowledge of entities used in PCS event extraction, (b) Bayesian Network structure refinement using causal knowledge from Concept Net used in PCS event understanding, (c) knowledge-driven piecewise linear approximation of nonlinear time series dynamics using Linear Dynamical Systems (LDS) used in PCS event understanding, and the (d) transforming knowledge of goals and actions into a Markov Decision Process (MDP) model used in PCS action recommendation.
We evaluate the benefits of the proposed techniques on real-world applications involving traffic analytics and Internet of Things (IoT).
Dissertation Defense:
" Mining and Analyzing Subjective Experiences in User Generated Content "
By Lu Chen
Tuesday, April 9, 2016
Dissertation Committee: Dr. Amit Sheth, Advisor, Dr. T. K. Prasad, Dr. Keke Chen, Dr. Ingmar Weber, and Dr. Justin Martineau,
Pictures: https://www.facebook.com/Kno.e.sis/photos/?tab=album&album_id=1225911137443732
Video: https://youtu.be/tzLEUB-hggQ
Lu's Home page: http://knoesis.wright.edu/researchers/luchen/
ABSTRACT
Web 2.0 and social media enable people to create, share and discover information instantly anywhere, anytime. A great amount of this information is subjective information -- the information about people's subjective experiences, ranging from feelings of what is happening in our daily lives to opinions on a wide variety of topics. Subjective information is useful to individuals, businesses, and government agencies to support decision making in areas such as product purchase, marketing strategy, and policy making. However, much useful subjective information is buried in ever-growing user generated data on social media platforms, it is still difficult to extract high quality subjective information and make full use of it with current technologies.
Current subjectivity and sentiment analysis research has largely focused on classifying the text polarity -- whether the expressed opinion regarding a specific topic in a given text is positive, negative, or neutral. This narrow definition does not take into account the other types of subjective information such as emotion, intent, and preference, which may prevent their exploitation from reaching its full potential. This dissertation extends the definition and introduces a unified framework for mining and analyzing diverse types of subjective information. We have identified four components of a subjective experience: an individual who holds it, a target that elicits it (e.g., a movie, or an event), a set of expressions that describe it (e.g., "excellent", "exciting"), and a classification or assessment that characterize it (e.g., positive vs. negative). Accordingly, this dissertation makes contributions in developing novel and general techniques for the tasks of identifying and extracting these components.
We first explore the task of extracting sentiment expressions from social media posts. We propose an optimization-based approach that extracts a diverse set of sentiment-bearing expressions, including formal and slang words/phrases, for a given target from an unlabeled corpus. Instead of associating the overall sentiment with a given text, this method assesses the more fine-grained target-dependent polarity of each sentiment expression. Unlike pattern-based approaches which often fail to capture the diversity of sentiment expressions due to the informal nature of language usage and writing style in social media posts, the proposed approach is capable of identifying sentiment phrase
Vahid Taslimitehrani's Dissertation Defense: Friday, February 19 2015.
Ph.D. Committee: Drs. Guozhu Dong, Advisor, T.K. Prasad, Amit Sheth, Keke Chen
and Jyotishman Pathak, Division of Health Informatics, Weill Cornell Medical College, Cornell University.
ABSTRACT:
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Cory Henson defended his thesis on "A Semantics-based Approach to Machine Perception".
Video can be found at: http://www.youtube.com/watch?v=L8M7eoGKtSE
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
Description - Ajith defended his thesis on application and data portability in cloud
computing. More details on Ajith's research and publications can be
found at http://knoesis.wright.edu/researchers/ajith/
Video can be found at : http://www.youtube.com/watch?v=oDBeBIIFmHc&list=UUORqXk1ZV44MOwpCorAROyQ&index=1&feature=plpp_video
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
Krishnaprasad Thirunarayan, Trust Management: Multimodal Data Perspective,
Invited Tutorial, The 2015 International Conference on Collaboration
Technologies and Systems (CTS 2015), June 2015
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersAmit Sheth
Abstract
Kno.e.sis (http://knoesis.org) is a world-class research center that uses semantic, cognitive, and perceptual computing for gathering insights from physical/IoT, cyber/Web, and social and enterprise (e.g., clinical) big data. We innovate and employ semantic web, machine learning, NLP/IR, data mining, network science and highly scalable computing techniques. Our highly interdisciplinary research impacts health and clinical applications, biomedical and translational research, epidemiology, cognitive science, social good, policy, development, etc. A majority of our $12+ million in active funds come from the NSF and NIH. In this talk, I will provide an overview of some of our major research projects.
Kno.e.sis is highly successful in its primary mission of exceptional student outcomes: our students have exceptional publication and real-world impact and our PhDs compete with their counterparts from top 10 schools for initial jobs in research universities, top industry research labs, and highly competitive companies. A key reason for Kno.e.sis' success is its unique work culture involving teamwork to solve complex problems. Practically all our work involves real-world challenges, real-world data, interdisciplinary collaborators, path-breaking research to solve challenges, real-world deployments, real-world use, and measurable real-world impact.
In this talk, I will also seek to discuss our choice of research topics and our unique ecosystem that prepares our students for exceptional careers.
This tutorial presents tools and techniques for effectively utilizing the Internet of Things (IoT) for building advanced applications, including the Physical-Cyber-Social (PCS) systems. The issues and challenges related to IoT, semantic data modelling, annotation, knowledge representation (e.g. modelling for constrained environments, complexity issues and time/location dependency of data), integration, analy- sis, and reasoning will be discussed. The tutorial will de- scribe recent developments on creating annotation models and semantic description frameworks for IoT data (e.g. such as W3C Semantic Sensor Network ontology). A review of enabling technologies and common scenarios for IoT applications from the data and knowledge engineering point of view will be discussed. Information processing, reasoning, and knowledge extraction, along with existing solutions re- lated to these topics will be presented. The tutorial summarizes state-of-the-art research and developments on PCS systems, IoT related ontology development, linked data, do- main knowledge integration and management, querying large- scale IoT data, and AI applications for automated knowledge extraction from real world data.
Related: Semantic Sensor Web: http://knoesis.org/projects/ssw
Physical-Cyber-Social Computing: http://wiki.knoesis.org/index.php/PCS
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
Social Media Mining is the process of obtaining big data from user-generated content on social media websites and mobile apps in order to extract patterns, form conclusions about users, and act upon the information, often for the purpose of advertising to users or conducting research.
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussenwkwsci-research
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Dataconomy Media
Trump, Brexit, Cambridge Analytica... In the last few years, we have had to confront the consequences of the use and misuse of data science algorithms in manipulating public opinion through social media. The use of private data to microtarget individuals is a daily practice (and a trillion-dollar industry), which has serious side-effects when the selling product is your political ideology. How can we cope with this new scenario?
Slideshare lost the previous upload which had nearly 70K views. Re-uploading. http://knoesis.org/?q=node/2633
With the explosion in social media (1B+ Facebook users, 500M+ Twitter users) and ubiquitous mobile access (6B+ mobile phone subscribers) sharing their observations and opinions, we have unprecedented opportunities to extract social signals, create spatio-temporal mappings, perform analytics on social data, and support applications that vary from situational awareness during crisis response, preparedness and rebuilding phases to advanced analytics on social data, and gaining valuable insights to support improved decision making.This tutorial weaves three themes and corresponding relevant topics- a.) citizen sensing and crisis mapping, b.) technical challenges and recent research for leveraging citizen sensing to improve crisis response coordination, and c.) experiences in building robust and scalable platforms/systems. It will couple technical insights with identification of computational techniques and algorithms along with real-world examples. We will also do exemplary demos of the features in the Sahana, CrowdMap (Ushahidi's version) and Twitris platforms while elaborating on the practical issues and pitfalls of the development and operation of these large-scale platforms, especially during the real-time crisis response
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsAmit Sheth
Opening talk at Singapore Symposium on Sentiment Analysis (S3A), February 6, 2015, Singapore. http://s3a.sentic.net/#s3a2015
Abstract
With the rapid rise in the popularity of social media, and near ubiquitous mobile access, the sharing of observations and opinions has become common-place. This has given us an unprecedented access to the pulse of a populace and the ability to perform analytics on social data to support a variety of socially intelligent applications -- be it for brand tracking and management, crisis coordination, organizing revolutions or promoting social development in underdeveloped and developing countries.
I will review: 1) understanding and analysis of informal text, esp. microblogs (e.g., issues of cultural entity extraction and role of semantic/background knowledge enhanced techniques), and 2) how we built Twitris, a comprehensive social media analytics (social intelligence) platform.
I will describe the analysis capabilities along three dimensions: spatio-temporal-thematic, people-content-network, and sentiment-emption-intent. I will couple technical insights with identification of computational techniques and real-world examples using live demos of Twitris (http://twitris2.knoesis.org).
The web of where: How location is being woven into the webKevin Anderson
The presentation from my keynote at the SpotOn locative media conference, organised by Finnish public broadcaster YLE at Aalto University in Helsinki. I talked about why delivering relevant content is a key to success in today's crowded media environment and how location can be one way to deliver relevant content to audiences.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Personalized and Adaptive Semantic Information Filtering for Social Media - Pavan Kapanipathi's Defense
1. Personalized and Adaptive Semantic Information
Filtering for Social Media
Pavan Kapanipathi, PhD Candidate
Kno.e.sis Center, Wright State University
Committee: Drs. Amit Sheth (Advisor), Krishnaprasad Thirunarayan,
Derek Doran, and Prateek Jain
Ohio Center of Excellence in Knowledge-Enabled Computing
4. Information Consumption on Social
Media
• Updates of Friends and
Acquaintances
• News [1]
– 86% of Twitter
users surveyed
4
Introduction
5. Information Consumption on Social
Media
• Updates of Friends and
Acquaintances
• News [1]
– 86% of Twitter
users surveyed
• Medical Information [2]
– 1 in 3 use social media
5
Introduction
6. Information Consumption on Social
Media
• Updates of Friends and
Acquaintances
• News [1]
– 86% of Twitter
users surveyed
• Medical Information [2]
– 1 in 3 use social media
• Disaster Management [3]
– 20 million tweets on Hurricane Sandy
– Most crisis management agencies
monitor social media 6
Introduction
7. Information Overload on Social
Media
• Users often complain of
getting overwhelmed with
the information on social
media
• 5 billion posts per day
– Real-time information
• 1000+ in my social network
7
“...a wealth of information creates a poverty of attention...”
Herbert A. Simon
Introduction
8. Need for Information Filtering
• Scenario
– Address information overload
– Enormous data stream has to be
filtered
• Information Filtering Systems
– Emails, News, and Blogs
– Functionality
• Understand user interests
• Deliver relevant information
8
Introduction
10. Traditional Information
Filtering
10
User Interest
Identification/User
Modeling
Filtering Module
Streaming Data
User
Generated
Content
Filtered
Data
Hanani, Uri, Bracha Shapira, and Peretz Shoval. "Information filtering: Overview of issues, research and systems." User
Modeling and User-Adapted Interaction 11.3 (2001): 203-259.
NBA
Basketball
Sports
Relevance: 0.9
Introduction
11. Challenges
1. Lack of Context
• Lack of context for processing short-text
– Short-Text
• Average length of social media posts (Facebook, Twitter, Google+, etc.)
are 100-160 characters
• Identifying topics from short-text is important
– We can infer the author’s interest and deliver the tweet to interested
users in the topic
– Traditional techniques are shown to have not perform well on social
media [Sriram 2010, Derczynski 2013]
11
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect game.
12. Introduction
Challenges
2. Continuously Changing Vocabulary
• Social media is a real-time platform with information about
latest activities in the real-world
• Hurricane Sandy
– Mitigation, preparedness, recovery, and response phases
– #Frankenstorm and #Sandy, at the start, to #StaySafe and #RedCross during the
disaster and #ThanksSandy and #RestoreTheShore after the hurricane
• Indian Elections
– the announcement of prime ministerial candidates, issues
regarding corruptions, and polls in different states
– #modikisarkar, #NaMo, #VoteForRG, and #CongBJPQuitIndia
12
Civil Unrest Election Natural Disaster
13. Challenges
3. Scalability
• Practical aspects of the filtering system
• Popularity of social media is increasing
– Facebook has more than 1 billion users
– Twitter has more than 500 million users
• Disseminate information to a huge set of users
– Centralized disseminating systems either overload the client of
the server. (Push or Pull model)
13
Introduction
14. Introduction
Knowledge Bases
• A common theme across the methodologies developed is
the use of background knowledge and Semantic Web
technologies.
• Background knowledge to process short-text leverage
knowledge bases
14
“If a program is to perform a complex task well, it must
know a great deal about the world in which it operates.”
Lenat & Feigenbaum
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect
game.
BaseballJason Herward
Kris Bryant
Chicago Cubs
Sports
15. Wikipedia as a Knowledge Base
• Requirements for a Knowledge base to be used for filtering
social data
– Diversity and Comprehensiveness: Large set of diverse users on social
media such as Twitter and Facebook
– Real-time updates: Social media is a real-time platform the discusses
dynamic topics
• Wikipedia as the Knowledge base
– Semi structured – Extract the structure
– Diverse: Collaborative effort of 80,000 users with 5 million articles
– Near real-time updates with unbiased views on topics [Ferron 2011]
15
Introduction
16. Thesis Statement
16
To build an effective information filtering system, background
knowledge and Semantic Web technologies can be used to
address lack of context, dynamic changing vocabulary and
scalability challenges introduced by social media’s short-text
and real-time nature.
Introduction
17. Outline
• Short-Text: Lack of context for processing
– Hierarchical Interest Graphs
– Built a hierarchical context for tweets leveraging Wikipedia category
structure. This hierarchical context is utilized for user modeling and
recommendations.
– Publications [ESWC 2014, WWWCOMP 2014, TR-JRNL 2016]
• Real-time and dynamic nature: Continuously changing
vocabulary
– A novel methodology that utilizes the evolving Wikipedia hyperlink
structure to detect topic-relevant hashtags for continuous filtering
– Publications [TR-CNF 2016, ESWC 2015]
• Popularity: Scalability
– Scalable distributed dissemination system that utilizes Sematic Web
technologies.
– Publications [ISWC 2011, SPIM 2011, ISWCDEM 2011]
17
Introduction
18. Outline
• Short-Text: Lack of context for processing
– Hierarchical Interest Graphs
– Built a hierarchical context for tweets leveraging Wikipedia category
structure. This hierarchical context is utilized for user modeling and
recommendations.
• Real-time and Dynamic Nature: Continuously Changing
Vocabulary
– A novel methodology that utilizes the evolving Wikipedia hyperlink
structure
• Popularity: Scalability
– Scalable distributed dissemination system that utilizes Sematic Web
technologies.
18
Lack of context
19. Baseball
• User generated content is processed to understand user
interests and filtering
– Tweets are used for these experiments
• Wikipedia category structure comprises taxonomical information
that can be leveraged
– Build context for short text for user interest identification
Processing Short-text for User
Interest Identification
19
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect game.
“You are what you share”
Charles W. Leadbeater
Lack of context
ESWC 2014
20. Content Based User Interests
Identification from Social Data
20Semantics
Term Frequency
Based
Techniques
Lower Dim Space
as latent
semantics
Entity Based
Techniques
[Tao 2012][Ramage 2010]
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect game.
Not sure who the Reds will look too
replace Dusty.some very interesting
jobs open (Cubs, Mariners, Reds, poss
Yanks) Girardi the domino sports
[Yan 2012]
Term Freq
great 1
day 1
sports 2
cubs 2
…
Dim Dist
1dim 0.3
2dim 0.2
3dim 0.2
4dim 0.1
5dim 0.4
Wiki-Entities Freq
Chicago Cubs 2
Cinci Reds 2
White Sox 1
NY Yankees 1
…
Knowledge
Enabled
Approaches
Lack of context
ESWC 2014
21. Implicit Information from Social Data
21
BroaderRelated
Interests
Major League
Baseball
Major League
Baseball Teams
Baseball
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect
game.
Not sure who the Reds will look too
replace Dusty.some very interesting
jobs open (Cubs, Mariners, Reds,
poss Yanks) Girardi the domino
San Francisco Giants
Oakland Athletics
Baseball Organizations
Lack of context
ESWC 2014
22. 22
BroaderRelated
Interestsfrom
WikipediaCategory
Structure
Major League
Baseball
Major League
Baseball Teams
Baseball
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect game.
Not sure who the Reds will look too
replace Dusty.some very interesting
jobs open (Cubs, Mariners, Reds,
poss Yanks) Girardi the domino
Methodology: Structured
Hierarchical Knowledge
0.6 1.0 0.3 0.3
Seattle
Mariners
White Sox
Cincinnati
Reds
Chicago Cubs
Transformed
Wikipedia Category
Structure to a
Wikipedia Hierarchy
Lack of context
ESWC 2014
23. 23
SpreadingActivation
Major League
Baseball
Major League
Baseball Teams
Baseball
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect game.
Not sure who the Reds will look too
replace Dusty.some very interesting
jobs open (Cubs, Mariners, Reds,
poss Yanks) Girardi the domino
Methodology: Scoring the Inferred
Hierarchical Knowledge
0.6 1.0 0.3 0.3
Seattle
Mariners
White Sox
Cincinnati
Reds
Chicago Cubs
0.5
0.4
0.1
Lack of context
ESWC 2014
24. Designing an Activation Function
• Design parameters to adapt to the structure of Wikipedia
Hierarchy
– Uneven distribution of nodes in the hierarchy
• 16 hierarchical levels – most categories between 5-9 hierarchical level
– Raw Normalization 𝐹𝑛𝑖
= 1 𝑛𝑜𝑑𝑒𝑠(𝑖+1)
– Log Normalization 𝐹𝐿 𝑛𝑖
= 1 𝑙𝑜𝑔10 𝑛𝑜𝑑𝑒𝑠(𝑖+1)
– Many-many for category-subcategory relationships
• Boston Red Sox – Major League Baseball Teams , 1901 Establishments
in Massachusetts
– Preferential Path Constraint 𝑃𝑖𝑗= 1 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦𝑗𝑖
– Boosting common ancestors
• More entities activating the concept, better is its importance
– Intersect Booster 𝐵𝑖 = 𝑁𝑒𝑖
𝑁𝑒𝑖𝑐𝑚𝑎𝑥
24
Lack of context
ESWC 2014
25. Activation Functions
• Bell (Raw Normalization)
𝐴𝑗 = 𝐴𝑖 × 𝐹𝑗
𝑛
𝑖=0
• Bell Log (Log Normalization)
𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗
𝑛
𝑖=0
• Priority Intersect (Log Normalization , Preferential Path, Intersect
Booster)
𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗 × 𝑃𝑗𝑖 × 𝐵𝑗
𝑛
𝑖=0
25
i is the child node
j is the category
Ai is the activated value of i
Lack of context
ESWC 2014
26. 26
ActivationFunctions
Major League
Baseball
Major League
Baseball Teams
Baseball
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect game.
Not sure who the Reds will look too
replace Dusty.some very interesting
jobs open (Cubs, Mariners, Reds,
poss Yanks) Girardi the domino
Hierarchical Interest Graph
0.6 1.0 0.3 0.3
Seattle
Mariners
White Sox
Cincinnati
Reds
Chicago Cubs
0.5
0.4
0.1
BELL
BELL LOG
PRIORITY
INTERSECT
Lack of context
ESWC 2014
27. Hierarchical Interest Graph
Evaluation – User Study
Tweets Entities Distinct
Entities
Categories
in HIG
37 31,927 29,146 13,150 111,535
27
Users Tweets Distribution
Lack of context
ESWC 2014
28. Evaluation Results of Hierarchical
Interests
28
Graded Precision
Mean Average Precision
Relevant Irrelevant Maybe
k Bell Bell Log Priority
Intersect
Bell Bell Log Priority
Intersect
Bell Bell Log Priority
Intersect
10 0.53 0.67 0.76 0.34 0.23 0.16 0.13 0.10 0.08
20 0.54 0.66 0.72 0.34 0.22 0.19 0.12 0.12 0.09
30 0.53 0.64 0.69 0.34 0.24 0.21 0.13 0.12 0.10
40 0.52 0.61 0.68 0.35 0.26 0.22 0.13 0.13 0.10
50 0.52 0.61 0.67 0.36 0.28 0.24 0.12 0.11 0.09
k Bell Bell Log Priority
Intersect
10 0.64 0.72 0.88
20 0.61 0.7 0.82
30 0.59 0.69 0.79
40 0.58 0.68 0.77
50 0.57 0.67 0.75
Numbers in Bold
portray better
performance
Lack of context
ESWC 2014
29. On this day in 1934, Major League Baseball
announced it would host its first night games
Great day for Chicago sports as well
as Cubs beat the Reds, Sox beat the
Mariners with Humber’s perfect
game, Bulls win and Hawks stay alive
Implicit Interests Evaluation
• Implicit interests are categories of interest that were not
explicitly mentioned in tweets but inferred from the knowledge-
base
29
Category: Major
League Baseball
Explicit
Implicit
Lack of context
ESWC 2014
30. Summary
Hierarchical Interest Graphs
• Addressed the “Lack of Context” challenge in tweets using
Hierarchical Knowledge base.
– More than 70% of hierarchical interests are implicit.
• A new way to represent Twitter user interests
– Hierarchical Interest Graph with interest scores at each nodes
– Activation Function (models) to determine interest scores
What’s the use?
30
Lack of context
ESWC 2014
32. Content-based Tweet
Recommendation Approaches
• Term Frequency based approaches
– User profiles: Built on scoring important terms
• TF, TF-IDF
• Entity Frequency [Tao 2012]
– User profiles: Built on scoring important entities
• Wikipedia Entities
• Extracted using Zemanta
• Support Vector Machines (SVMrank) [Duan 2010]
– User Models built using content and tweet based features
– Tweet content features: Similarity to users tweets, similarity of hashtags,
tweet length, mention of URLs, mention of hashtags.
• Latent Dirichlet Allocation [Ramage 2010]
– User profiles: Distribution of 5 latent topics.
32
Lack of context
TR-JRNL 2016
33. Experimental Setup
• Utilized the same dataset from the user study
• Training and testing datasets using two assumptions
– Tweets what users share are interesting to them and can be
recommended (UGC Assumption)
• 80% to create user profiles
• 20% (~6,000) to test recommendation
– Retweets of users are interesting to them and can be recommended
(Retweet Assumption and is more popular in literature)
• 30% (~9,000) were retweets, hence used to test recommendation
• 70% to create user profiles
33
Users Tweets Entities
37 31,927 29,146
Lack of context
TR-JRNL 2016
34. Evaluation Methodology
• Transformed to a top-N recommendation evaluation
– Popular top-N evaluation methodology by Cremonesi et al. [Cremonesi
2010] for Precision/Recall
• Methodology
– For every test tweet – pick random 1000 tweets not tweeted/retweeted
by the author of the test tweet
• Random tweets are considered to be irrelevant to the user
– Score and rank the test tweet with the 1000 random tweets using the
recommendation algorithm
• TF, TFIDF, Entity-based, SVMrank, LDA, and HIG
– If the test tweet is within the top-N, its considered to be a hit otherwise
not ( T is the total number of test tweets)
𝑟𝑒𝑐𝑎𝑙𝑙 = ℎ𝑖𝑡𝑠 𝑇
34
Lack of context
TR-JRNL 2016
35. Retweet Assumption Evaluation
Results
• Term frequency performs the best for recommending
retweets tweets [Ramage et al 2010]
35
Lack of context
TR-JRNL 2016
36. UGC Assumption Evaluation Results
• HIG performed better for most top-N but at Top-20 TF-
based approaches performed better.
36
Lack of context
TR-JRNL 2016
37. Lack of context
Content + Knowledge based
Approach
• TF performed the best in content based approaches
• Merged TF and HIG which augments content with
knowledge bases and recommend using Pearson Correlation
37
World Wide Web: 0.4
Technology: 0.007
Sports: 0.06
Baseball: 0.34
India: 0.102
United States: 0.2
Semantic Web: 0.2
world: 3
great: 10
cricket: 24
slim: 13
good: 40
united: 34
states: 30
T
F
H
I
G
NORMALIZED
world: 0.075
great: 0.25
cricket: 0.6
slim: 0.325
good: 1
united: 0.85
states: 0.75
World Wide Web: 1
Technology: 0.017
Sports: 0.15
Baseball: 0.85
India: 0.25
United States: 0.5
Semantic Web: 0.5
MERGED
world: 0.075
great: 0.25
cricket: 0.6
slim: 0.325
good: 1
united: 0.85
states: 0.75
World Wide Web: 1
Technology: 0.017
Sports: 0.15
Baseball: 0.85
India: 0.25
United States: 0.5
Semantic Web: 0.5
TR-JRNL 2016
39. UGC Assumption Evaluation Results
• TF + HIG performs the best and provides an improvement
of more than 20% at top-20
39
Lack of context
TR-JRNL 2016
40. Summary
Hierarchical Interest Graphs
• A new way to represent Twitter user Interests
– Hierarchy Interest Graphs
• Addressed the “Lack of Context” challenge in tweets using
hierarchical knowledge base.
• HIG (knowledge base) augments content to provide
superior performance for tweet recommendation.
40
Lack of context
TR-JRNL 2016
41. Outline
• Short-Text: Lack of context for processing
– Augmented content with hierarchical knowledge from Wikipedia
• 70% of the top-50 interests were implicit (not mentioned in users’
tweets)
• Improved content based tweet recommendation by more than 40%.
• Real-time and Dynamic Nature: Continuously Changing
Vocabulary
– A novel methodology that utilizes the evolving Wikipedia hyperlink
structure to update filters for streaming topic-relevant information
• Popularity: Scalability
– Scalable distributed dissemination system that utilizes Sematic Web
technologies.
41
Lack of context
42. Outline
• Short-Text: Lack of context for processing
– Augmented content with hierarchical knowledge from Wikipedia
• 70% of the top-50 interests were implicit (not mentioned in users’
tweets)
• Improved tweet recommendation by more than 40%.
• Real-time and Dynamic Nature: Continuously Changing
Vocabulary
– A novel methodology that utilizes the evolving Wikipedia hyperlink
structure to update filters for streaming topic-relevant information
• Popularity: Scalability
– Scalable distributed dissemination system that utilizes Sematic Web
technologies.
42
Dynamic vocabulary
43. • Dynamic topics of interest that continuously evolve over
time
– Indian Elections
• the announcement of prime ministerial candidates, issues
regarding corruptions, and polls in different states
– Hurricane Sandy
• Mitigation, preparedness, recovery, and response phases
Social media: Real-time and Dynamic
Platform
43
Indian Election Hurricane Sandy
Dynamic vocabulary
TR-CNF 2016
44. • Keyword-based filtering
– Twitter streaming API
• Keywords are dynamically changing based on the
happenings in the real-world
– Necessary to track these keywords to be up-to-date regarding
the topic of interest
Filtering Dynamic Topics on Social
Media
44
#indianelection #sandy
#modikisarkar, #NaMo,
#VoteForRG, and
#CongBJPQuitIndia
#Frankenstorm ,#Sandy,
#RedCross,
#RestoreTheShore
Dynamic vocabulary
TR-CNF 2016
45. Topic-relevant hashtags that can be used
to crawl all the tweets co-occur with
each other
(1) Colorado Shooting (2) Occupy Wall Street
Analysis with over 6 million tweets
Hindsight Analysis of Topic-relevant
Hashtags
45
<1% of the topic-relevant hashtags can
crawl up to 85% of the tweets
Dynamic vocabulary
TR-CNF 2016
46. Approach for Detecting Topic-
Relevant Hashtags
46
Co-occurring:
Threshold δ
#indianelection2014
#modikisarkar
Manually started filter
Indian General
Election,_2014
Dynamically Updated
Background Knowledge
One hop from Topic
Page
Entity scoring based
on relevance to the Event
Indian General Elec: 1.0
India: 0.9
Elections: 0.7
UPA: 0.6
BJP: 0.3
NDA: 0.3
Narendra Modi: 0.3
Narendra Modi: 0.9
BJP: 0.7
NDA: 0.6
India: 0.4
Elections: 0.2
Rahul Gandhi: 0.2
Congress: 0.2
Entity Extraction
and Scoring
Normalized
Frequency
Scoring
Latest K (200,500)
Similarity
Check
Extract, Periodically
Update Hyperlink structure
Dynamic vocabulary
TR-CNF 2016
49. • Hashtag analysis
– Co-occurrence technique can be used to detect event relevant hashtags
– More popular hashtags are easier to be detected via co-occurrence
• Continuously changing vocabulary for dynamic topics and coverage
– Wikipedia as a dynamic knowledge-base for events
– Determining relevant hashtags using asymmetric similarity measure
– More hashtags in turn increase the coverage of tweets for events
• Content-based location prediction of Twitter users (ESWC 2015)
– Similar framework of relevancy detection was used for location prediction
Dynamic Hashtag Filter
49
Dynamic vocabulary
TR-CNF 2016
50. Outline
• Short-Text: Lack of context for processing
– Augmented content with hierarchical knowledge from Wikipedia
• 70% of the top-50 interests were implicit (not mentioned in users’ tweets)
• Improved content based tweet recommendation by more than 40%.
• Real-time and Dynamic Nature: Continuously Changing Vocabulary
– Hindsight analysis insight: co-occurrence can be used as a starting point
– Utilized Wikipedia as an evolving knowledge base for dynamic topics
• top-5 detected, increased the coverage by more than 3,500 tweets instantly
with a mean average precision of 0.92
• Popularity: Scalability
– Scalable distributed dissemination system that utilizes Sematic Web
technologies.
50
Dynamic vocabulary
51. Outline
• Short-Text: Lack of context for processing
– Augmented content with hierarchical knowledge from Wikipedia
• 70% of the top-50 interests were implicit (not mentioned in users’ tweets)
• Improved content based tweet recommendation by more than 40%.
• Real-time and Dynamic Nature: Continuously Changing Vocabulary
– Hindsight analysis insight: co-occurrence can be used as a starting point
– Utilized Wikipedia as an evolving knowledge base for dynamic topics
• top-5 detected, increased the coverage by more than 3,500 tweets instantly
with a mean average precision of 0.92
• Popularity: Scalability
– Scalable distributed dissemination system that utilizes Sematic Web
technologies.
51
Scalability
52. Content Dissemination
• Centralize content dissemination suffers from scalability
issues
– Server (publisher) or the Client (subscriber) are overwhelmed
– Server for Push and Client for Pull
• Distributed dissemination protocol
– Pubsubhubbub
• Introduced by Google in 2009
• 117 million users and 5.5 billion posts broadcasted by 2011
52
Scalability
ISWC 2011
53. • PubSubHubbub
– Simple, Open, web-hook based pubsub protocol
– Extension to RSS, Atom.
535353
Publisher SubscriberHub
I have new
content for
feed X
Give me the
latest content for
feed X
Here it is
Subscriber
Subscriber
Subscriber
Subscriber
Here is the
latest content
for feed X
Scalability
ISWC 2011
54. 54
PubSubHubbub Protocol Extension
Pub
Sub - A
Sub - B
Sub - C
Sub - D
Hey I have new
content for feed
topics/preference
Social Graph
and User
Profiles
Get the subscribers
of Pub whose profile
matches
topic/preference
Here is the
new content
of feed X
Give me
the new
content
Here it
is
Semantic Hub
Scalability
ISWC 2011
55. Publisher – Social Data Annotation
• Preliminary processing of text for filtering
– Information extraction (entities, hashtags, urls, etc.)
• Representing as RDF using vocabulary used by SMOB
– Comprises
• SPARQL Queries representing the subset of subscribers from the Social
Graph in the hub
55
Scalability
<http://twitter.com/rob/statuses/123456789>
rdf:type sioct:MicroblogPost ;
sioc:content "Great day for Chicago sports as
well as Cubs beat the Reds, Sox beat the Mariners with
Humber’s perfect game #chicago“ ;•
sioc:has_creator <http://example.com/rob> ;
moat:taggedWith dbpedia:Chicago ;
moat:taggedWith dbpedia:Chicago_Cubs ;
moat:taggedWith dbpedia:Cincinnati_Reds ;
sioc:topic <http://example.com/tags/chicago> .
ISWC 2011
56. Semantic Hub
• Performs the matching of processed post to user profiles
– Flexible to different matching techniques
• Pearson correlation or other similarity measures
• Delivers information to relevant subscribers.
56
Scalability
SELECT ?user WHERE {
{ ?user foaf:interest dbpedia:Chicago } UNION
{ ?user foaf:interest dbpedia:Chicago_Cubs } UNION
{ ?user foaf:interest dbpedia:Cincinnati_Reds }
}
ISWC 2011
57. Semantic Hub: Conclusion
• Framework for distributed dissemination of content using
PubSubHubbub
– Hub takes the load of the filtering module and dissemination of
content
• PubSubHubbub
– 117 million subscriptions by 2011
– 5.5 billion unique feeds by 2011
• Semantic Hub
– Privacy-aware dissemination for distributed social networks
– Real-time filtering
57
Scalability
ISWC 2011
58. • To build an effective information filtering system, background
knowledge and Semantic Web technologies can be used to
address lack of context, dynamic changing vocabulary and
scalability challenges introduced by social media’s short-text
and real-time nature.
– Augmented content with hierarchical knowledge from Wikipedia to
improve context of short-text
• 70% of the top-50 interests were implicit (not mentioned in users’ tweets)
• Improved content based tweet recommendation by more than 40%.
– Utilized Wikipedia as an evolving knowledge base for dynamic topics to
detect topic-descriptors for filtering
• Hindsight analysis insight: co-occurrence can be used as a starting point
• top-5 detected, increased the coverage by more than 3,500 tweets instantly
with a mean average precision of 0.92
– Extended PubSubHubbub, a distributed content dissemination protocol
with Semantic Web technologies for filtering and dissemination
58
Conclusion
Thesis Conclusion
59. Graduate Journey
• Hierarchical Interest Graphs
– Internship work – IBM TJ Watson Research Center 2013
• Location Prediction of Twitter users
– Alleviates the dependence on training data
• Determining Twitter User Hobbies
– Internship work – Samsung Research America 2014 (Patent
Pending)
• Tweet Filtering and Recommendation
– Addressing the problem of dynamic topic drift. 59
Conclusion
60. Conclusion
Graduate Journey
• Research Internships
– 2011 DERI, Ireland (ISWC 2011, SPIM 2011, WebSci 2011)
– 2013 IBM TJ Watson Research Center (WWWCOMP 2014,
ESWC2014)
– 2014 Samsung Research America (Patent Pending)
• Invited talks
– IBM TJ Watson Research Center, Frontiers of Cloud
Computing and Big Data Workshop
– EMC CTO Office, Bangalore, Invited Speaker Series
– WSU Advisory Board
• Proposals and Projects
– Twitris – NSF Commercialization
– Ohio State University – NSF Hazards SEES ($2M)
– CITAR (Epidemiology) – NIH EdrugTrends ($1.6M)
• Development of Research Systems
– Twarql – A semantic tweet filtering system.
• Winner of Triplification Challenge (ISem2010)
– Scalable content dissemination on distributed social
networks. (ISWC2011)
– Twitris – A social semantic web for analyzing events.
60
COLLABORATIONS
CITAR
61. Publications
• [NOISE 2015] Raghava Mutharaju, and Pavan Kapanipathi. Are We Really Standing on the
Shoulders of Giants? 1st Workshop on Negative or Inconclusive Results in Semantic Web
2015, ESWC, 2015.
• [KNOW 2015] Siva Kumar Chekula, Pavan Kapanipathi, Derek Doran, Amit Sheth. Entity
Recommendations Using Hierarchical Knowledge Bases. 4th International Workshop on
Knowledge Discovery and Data Mining Meets Linked Open Data, 2015.
• [ESWC 2015] Pavan Kapanipathi, Revathy Krishnamurthy (Joint first author), Amit Sheth,
Krishnaprasad Thirunarayan. Knowledge Enabled Approach to Predict the Location of Twitter
Users. In Extended Semantic Web Conference, 2015. (acceptance rate 23%).
• [ESWC 2014] Pavan Kapanipathi, Prateek Jain, Chitra Venkataramani, Amit Sheth. User
Interests Identification on Twitter Using a Hierarchical Knowledge Base. In Extended Semantic
Web Conference 2014, Crete Greece. (acceptance rate 23%)
• [WWWComp 2014] Pavan Kapanipathi, Prateek Jain, Chitra Venkataramani, Amit Sheth.
Hierarchical Interest Graph from Twitter. 23rd International conference on World Wide Web
companion 2014 (WWW companion 2014), Seoul, South Korea.
• [WI 2013] Fabrizio Orlandi, Pavan Kapanipathi, Alexandre Passant, Amit Sheth. Characterising
concepts of interest leveraging Linked Data and the Social Web. The 2013 IEEE/WIC/ACM
International Conference on Web Intelligence, Atlanta, USA, United States, 2013.
• [SPIM 2011] Pavan Kapanipathi, Fabrizio Orlandi, Amit Sheth, Alexandre Passant.
Personalized Filtering of the Twitter Stream. 2nd workshop on Semantic Personalized
Information Management at ISWC 2011, September 2011.
• [ISWC 2011] Pavan Kapanipathi, Julia Anaya, Amit Sheth, Brett Slatkin, Alexandre Passant.
Privacy-Aware and Scalable Content Dissemination in Distributed Social Network. 10th
International Semantic Web Conference 2011, Bonn, Germany, September 2011. (acceptance
rate 22%)
61
Conclusion
62. Conclusion
Publications• [ISWCDEM 2011] Pavan Kapanipathi, Julia Anaya, Alexandre Passant . SemPuSH: Privacy-
Aware and Scalable Broadcasting for Semantic Microblogging. 10th International Semantic
Web Conference 2011,
• [FSWE 2011] Pavan Kapanipathi. SMOB: The Best of Both Worlds. Federated Social Web
Europe Conference, Berlin, June 3rd -5th 2011.
• [WEBSCI 2011] Alexandre Passant, Owen Sacco, Julia Anaya, Pavan Kapanipathi. Privacy-By-
Design in Federated Social Web Applications, Websci 2011, Koblenz, Germany June 14-17,
2011.
• [ISEM 2010] Pablo Mendes, Pavan Kapanipathi, Alexandre Passant. Twarql: Tapping into the
Wisdom of the Crowd. Triplification Challenge 2010 at 6th International Conference on
Semantic Systems (I-SEMANTICS), [WI 2010]
• [WI 2010] Pablo Mendes, Alexandre Passant, Pavan Kapanipathi, Amit Sheth. Linked Open
Social Signals.WI2010 IEEE/WIC/ACM International Conference on Web Intelligence (WI-10),
• [WEBSCI 2010] Pablo Mendes, Pavan Kapanipathi, Delroy Cameron, Amit Sheth. Dynamic
Associative Relationships on the Linked Open Data Web. In Proceedings of the WebSci10:
Extending the Frontiers of Society On-Line
• [TR-CNF 2016] Pavan Kapanipathi, Krishnaprasad Thirunarayan, Fabrizio Orlandi, Amit Sheth,
Pascal Hitzler. A Real-Time #approach for Continuous Crawling of Events on Twitter by
Leveraging Wikipedia. Technical Report.
• [TR-JRNL 2016] Pavan Kapanipathi, Siva Kumar, Derek Doran, Prateek Jain, Chitra
Venkataramani, Amit Sheth. Hierarchical Knowledge Base enabled Twitter User Modeling and
Recommendation. (Journal).
• [TR-CNFC 2016] Siva Kumar, Pavan Kapanipathi, Derek Doran, Prateek Jain, Amit Sheth.
Exploring Taxonomical Interests for Entity Recommendations. Technical report, 2015.
• [TR-CNFC 2016] Sarasi Sarangi, Pavan Kapanipathi, Amit Sheth. Domain-specific Sub graph
Generation. Technical report, 2015. 62
63. Conclusion
References
• [1] How Do People Use Social Media for Business/Finance News?
http://blog.marketwired.com/2013/11/12/how-do-people-use-social-media-for-businessfinance-news/
• [2] What is the role of social media in healthcare? http://worldofdtcmarketing.com/role-social-media-
healthcare/social-media-and-healthcare/
• [3] Social media use during disaster management http://www.emergency-management-degree.org/crisis/
• [Tao 2012] Tao, K., Abel, F., Gao, Q., and Houben, G.-J. (2012a). Tums: Twitter-based user
modeling service.
• [Ramage 2010] Ramage, D., Dumais, S., and Liebling, D. (2010). Characterizing microblogs with
topic models. AAAI’ 10.
• [Yan 2012] Yan, R., Lapata, M., and Li, X. (2012). Tweet recommendation with graph co-ranking. In
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics.
• [Duan 2010] Duan, Y., Jiang, L., Qin, T., Zhou, M., and Shum, H.-Y. (2010). An empirical study on
learning to rank of tweets. COLING ’10
• [Cremonesi 2010]Cremonesi, P., Koren, Y., and Turrin, R. (2010). Performance of recommender
algorithms on top-n recommendation tasks. RecSys2010
• [Sriram 2010] Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M. (2010).
Short text classification in twitter to improve information filtering. SIGIR ’10
• [Derczynsk 2013] Derczynski, L., Maynard, D., Aswani, N., and Bontcheva, K. (2013). Microblog-
genre noise and impact on semantic annotation accuracy. HT ’13,
• [Ferron 2011] Ferron, M. and Massa, P. (2011). Collective memory building in wikipedia: the case
of north african uprisings. WikiSys2011 63