Dissertation Defense:
" Mining and Analyzing Subjective Experiences in User Generated Content "
By Lu Chen
Tuesday, April 9, 2016
Dissertation Committee: Dr. Amit Sheth, Advisor, Dr. T. K. Prasad, Dr. Keke Chen, Dr. Ingmar Weber, and Dr. Justin Martineau,
Pictures: https://www.facebook.com/Kno.e.sis/photos/?tab=album&album_id=1225911137443732
Video: https://youtu.be/tzLEUB-hggQ
Lu's Home page: http://knoesis.wright.edu/researchers/luchen/
ABSTRACT
Web 2.0 and social media enable people to create, share and discover information instantly anywhere, anytime. A great amount of this information is subjective information -- the information about people's subjective experiences, ranging from feelings of what is happening in our daily lives to opinions on a wide variety of topics. Subjective information is useful to individuals, businesses, and government agencies to support decision making in areas such as product purchase, marketing strategy, and policy making. However, much useful subjective information is buried in ever-growing user generated data on social media platforms, it is still difficult to extract high quality subjective information and make full use of it with current technologies.
Current subjectivity and sentiment analysis research has largely focused on classifying the text polarity -- whether the expressed opinion regarding a specific topic in a given text is positive, negative, or neutral. This narrow definition does not take into account the other types of subjective information such as emotion, intent, and preference, which may prevent their exploitation from reaching its full potential. This dissertation extends the definition and introduces a unified framework for mining and analyzing diverse types of subjective information. We have identified four components of a subjective experience: an individual who holds it, a target that elicits it (e.g., a movie, or an event), a set of expressions that describe it (e.g., "excellent", "exciting"), and a classification or assessment that characterize it (e.g., positive vs. negative). Accordingly, this dissertation makes contributions in developing novel and general techniques for the tasks of identifying and extracting these components.
We first explore the task of extracting sentiment expressions from social media posts. We propose an optimization-based approach that extracts a diverse set of sentiment-bearing expressions, including formal and slang words/phrases, for a given target from an unlabeled corpus. Instead of associating the overall sentiment with a given text, this method assesses the more fine-grained target-dependent polarity of each sentiment expression. Unlike pattern-based approaches which often fail to capture the diversity of sentiment expressions due to the informal nature of language usage and writing style in social media posts, the proposed approach is capable of identifying sentiment phrase
Wenbo Wang defended his PhD dissertation on automatic emotion identification from text. His dissertation focused on three areas: 1) Emotion classification using machine learning techniques to identify emotions from suicide notes and tweets. 2) Creating large self-labeled emotion datasets by leveraging hashtags on Twitter. 3) Adapting emotion identification models to new domains by selecting informative tweets to add to limited labeled data in target domains like blogs. The goal was to improve identification by utilizing large Twitter datasets while addressing challenges of limited labeled data in other domains.
Sujan Perera's Dissertation Defense: Friday, August 12, 2016
Ph.D. Committee: Drs. Amit Sheth, Advisor; T.K. Prasad, Michael Raymer, and Pablo Mendes (IBM Research)
Video: https://youtu.be/pbjJ1zb8ayY
ABSTRACT:
Natural language is a powerful tool developed by humans over hundreds of thousands of years. The extensive usage, flexibility of the language, creativity of the human beings, and social, cultural, and economic changes that have taken place in daily life have added new constructs, styles, and features to the language. One such feature of the language is its ability to express ideas, opinions, and facts in an implicit manner. This is a feature that is used extensively in day to day communications in situations such as: 1) expressing sarcasm, 2) when trying to recall forgotten things, 3) when required to convey descriptive information, 4) when emphasizing the features of an entity, and 5) when communicating a common understanding.
Consider the tweet 'New Sandra Bullock astronaut lost in space movie looks absolutely terrifying' and the text snippet extracted from a clinical narrative 'He is suffering from nausea and severe headaches. Dolasteron was prescribed.' The tweet has an implicit mention of the entity Gravity and the clinical text snippet has implicit mention of the relationship between medication Dolasteron and clinical condition nausea. Such implicit references of the entities and the relationships are common occurrences in daily communication and they add unique value to conversations. However, extracting implicit constructs has not received enough attention. This dissertation focuses on extracting implicit entities and relationships from clinical narratives and extracting implicit entities from Tweets.
This dissertation demonstrates manifestations of implicit constructs in text, studies their characteristics, and develops a solution that is capable of extracting implicit factual information from text. The developed solution starts by acquiring relevant knowledge to solve the implicit information extraction problem. The relevant knowledge includes domain knowledge, contextual knowledge, and linguistic knowledge. The acquired knowledge can take different syntactic forms such as a text snippet, structured knowledge represented in standard knowledge representation languages like Resource Description Framework (RDF) or custom formats. Hence, the acquired knowledge is processed to create models that can be understood by machines. Such models provide the infrastructure to perform implicit information extraction of interest.
This dissertation focuses on three different use cases of implicit information and demonstrates the applicability of the developed solution in these use cases. They are:
- implicit entity linking in clinical narratives,
- implicit entity linking in Twitter,
- implicit relationship extraction from clinical narratives.
There is a rapid intertwining of sensors and mobile devices into the fabric of our lives. This has resulted in unprecedented growth in the number of observations from the physical and social worlds reported in the cyber world. Sensing and computational components embedded in the physical world is termed as Cyber-Physical System (CPS). Current science of CPS is yet to effectively integrate citizen observations in CPS analysis. We demonstrate the role of citizen observations in CPS and propose a novel approach to perform a holistic analysis of machine and citizen sensor observations. Specifically, we demonstrate the complementary, corroborative, and timely aspects of citizen sensor observations compared to machine sensor observations in Physical-Cyber-Social (PCS) Systems.
Physical processes are inherently complex and embody uncertainties. They manifest as machine and citizen sensor observations in PCS Systems. We propose a generic framework to move from observations to decision-making and actions in PCS systems consisting of: (a) PCS event extraction, (b) PCS event understanding, and (c) PCS action recommendation. We demonstrate the role of Probabilistic Graphical Models (PGMs) as a unified framework to deal with uncertainty, complexity, and dynamism that help translate observations into actions. Data driven approaches alone are not guaranteed to be able to synthesize PGMs reflecting real-world dependencies accurately. To overcome this limitation, we propose to empower PGMs using the declarative domain knowledge. Specifically, we propose four techniques: (a) automatic creation of massive training data for Conditional Random Fields (CRFs) using domain knowledge of entities used in PCS event extraction, (b) Bayesian Network structure refinement using causal knowledge from Concept Net used in PCS event understanding, (c) knowledge-driven piecewise linear approximation of nonlinear time series dynamics using Linear Dynamical Systems (LDS) used in PCS event understanding, and the (d) transforming knowledge of goals and actions into a Markov Decision Process (MDP) model used in PCS action recommendation.
We evaluate the benefits of the proposed techniques on real-world applications involving traffic analytics and Internet of Things (IoT).
Video: https://www.youtube.com/watch?v=ZCToaDgxnAs
Abstract:
People's emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.
Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.
Understanding users’ latent intents behind search queries is essential for satisfying a user’s search needs. Search intent mining can help search engines to enhance its ranking of search results, enabling new search features like instant answers, personalization, search result diversification, and the recommendation of more relevant ads. Consequently, there has been increasing attention on studying how to effectively mine search intents by analyzing search engine query logs. While state-of-the-art techniques can identify the domain of the queries (e.g. sports, movies, health), identifying domain-specific intent is still an open problem. Among all the topics available on the Internet, health is one of the most important in terms of impact on the user and it is one of the most frequently searched areas. This dissertation presents a knowledge-driven approach for domain-specific search intent mining with a focus on health-related search queries.
First, we identified 14 consumer-oriented health search intent classes based on inputs from focus group studies and based on analyses of popular health websites, literature surveys, and an empirical study of search queries. We defined the problem of classifying millions of health search queries into zero or more intent classes as a multi-label classification problem. Popular machine learning approaches for multi-label classification tasks (namely, problem transformation and algorithm adaptation methods) were not feasible due to the limitation of label data creations and health domain constraints. Another challenge in solving the search intent identification problem was mapping terms used by laymen to medical terms. To address these challenges, we developed a semantics-driven, rule-based search intent mining approach leveraging rich background knowledge encoded in Unified Medical Language System (UMLS) and a crowd sourced encyclopedia (Wikipedia). The approach can identify search intent in a disease-agnostic manner and has been evaluated on three major diseases.
While users often turn to search engines to learn about health conditions, a surprising amount of health information is also shared and consumed via social media, such as public social platforms like Twitter. Although Twitter is an excellent information source, the identification of informative tweets from the deluge of tweets is the major challenge. We used a hybrid approach consisting of supervised machine learning, rule-based classifiers, and biomedical domain knowledge to facilitate the retrieval of relevant and reliable health information shared on Twitter in real time. Furthermore, we extended our search intent mining algorithm to classify health-related tweets into health categories. Finally, we performed a large-scale study to compare health search intents and features that contribute in the expression of search intent from 100+ million search queries from smarts devices (smartphones/tablets) and personal computers (desktops/laptops)
Video of the talk: https://www.youtube.com/watch?v=7k-u_TUew3o
Abstract: Social media has experienced immense growth in recent times. These platforms are becoming increasingly common for information seeking and consumption, and as part of its growing popularity, information overload pose a significant challenge to users. For instance, Twitter alone generates around 500 million tweets per day and it is impractical for users to have to parse through such an enormous stream to find information that are interesting to them. This situation necessitates efficient personalized filtering mechanisms for users to consume relevant, interesting information from social media.
Building a personalized filtering system involves understanding users interests and utilizing these interests to deliver relevant information to users. These tasks primarily include analyzing and processing social media text which is challenging due to its shortness in length, and the real-time nature of the medium. The challenges include: (1) Lack of semantic context: Social Media posts are on an average short in length, which provides limited semantic context to perform textual analysis. This is particularly detrimental for topic identification which is a necessary task for mining users interests; (2) Dynamically changing vocabulary: Most social media websites such as Twitter and Facebook generate posts that are of current (timely) interests to the users. Due to this real-time nature, information relevant to dynamic topics of interest evolve reflecting the changes in the real world. This in turn changes the vocabulary associated with these dynamic topics of interest making it harder to filter relevant information; (3) Scalability: The number of users on social media platforms are significantly large, which is difficult for centralized systems to scale to deliver relevant information to users. This dissertation is devoted to exploring semantic techniques and Semantic Web technologies to address the above mentioned challenges in building a personalized information filtering system for social media. Particularly, the necessary semantics (knowledge) is derived from crowd sourced knowledge bases such as Wikipedia to improve context for understanding short-text and dynamic topics on social media.
Vahid Taslimitehrani's Dissertation Defense: Friday, February 19 2015.
Ph.D. Committee: Drs. Guozhu Dong, Advisor, T.K. Prasad, Amit Sheth, Keke Chen
and Jyotishman Pathak, Division of Health Informatics, Weill Cornell Medical College, Cornell University.
ABSTRACT:
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
1) The document discusses a semantics-based approach to machine perception that uses semantic web technologies to derive abstractions from sensor data using background knowledge on the web.
2) It addresses three primary issues: annotation of sensor data, developing a semantic sensor web, and enabling semantic perception intelligence at the edge on resource-constrained devices.
3) The approach represents background knowledge and sensor observations using ontologies, and uses deductive and abductive reasoning over these representations to interpret sensor data at multiple levels of abstraction.
Wenbo Wang defended his PhD dissertation on automatic emotion identification from text. His dissertation focused on three areas: 1) Emotion classification using machine learning techniques to identify emotions from suicide notes and tweets. 2) Creating large self-labeled emotion datasets by leveraging hashtags on Twitter. 3) Adapting emotion identification models to new domains by selecting informative tweets to add to limited labeled data in target domains like blogs. The goal was to improve identification by utilizing large Twitter datasets while addressing challenges of limited labeled data in other domains.
Sujan Perera's Dissertation Defense: Friday, August 12, 2016
Ph.D. Committee: Drs. Amit Sheth, Advisor; T.K. Prasad, Michael Raymer, and Pablo Mendes (IBM Research)
Video: https://youtu.be/pbjJ1zb8ayY
ABSTRACT:
Natural language is a powerful tool developed by humans over hundreds of thousands of years. The extensive usage, flexibility of the language, creativity of the human beings, and social, cultural, and economic changes that have taken place in daily life have added new constructs, styles, and features to the language. One such feature of the language is its ability to express ideas, opinions, and facts in an implicit manner. This is a feature that is used extensively in day to day communications in situations such as: 1) expressing sarcasm, 2) when trying to recall forgotten things, 3) when required to convey descriptive information, 4) when emphasizing the features of an entity, and 5) when communicating a common understanding.
Consider the tweet 'New Sandra Bullock astronaut lost in space movie looks absolutely terrifying' and the text snippet extracted from a clinical narrative 'He is suffering from nausea and severe headaches. Dolasteron was prescribed.' The tweet has an implicit mention of the entity Gravity and the clinical text snippet has implicit mention of the relationship between medication Dolasteron and clinical condition nausea. Such implicit references of the entities and the relationships are common occurrences in daily communication and they add unique value to conversations. However, extracting implicit constructs has not received enough attention. This dissertation focuses on extracting implicit entities and relationships from clinical narratives and extracting implicit entities from Tweets.
This dissertation demonstrates manifestations of implicit constructs in text, studies their characteristics, and develops a solution that is capable of extracting implicit factual information from text. The developed solution starts by acquiring relevant knowledge to solve the implicit information extraction problem. The relevant knowledge includes domain knowledge, contextual knowledge, and linguistic knowledge. The acquired knowledge can take different syntactic forms such as a text snippet, structured knowledge represented in standard knowledge representation languages like Resource Description Framework (RDF) or custom formats. Hence, the acquired knowledge is processed to create models that can be understood by machines. Such models provide the infrastructure to perform implicit information extraction of interest.
This dissertation focuses on three different use cases of implicit information and demonstrates the applicability of the developed solution in these use cases. They are:
- implicit entity linking in clinical narratives,
- implicit entity linking in Twitter,
- implicit relationship extraction from clinical narratives.
There is a rapid intertwining of sensors and mobile devices into the fabric of our lives. This has resulted in unprecedented growth in the number of observations from the physical and social worlds reported in the cyber world. Sensing and computational components embedded in the physical world is termed as Cyber-Physical System (CPS). Current science of CPS is yet to effectively integrate citizen observations in CPS analysis. We demonstrate the role of citizen observations in CPS and propose a novel approach to perform a holistic analysis of machine and citizen sensor observations. Specifically, we demonstrate the complementary, corroborative, and timely aspects of citizen sensor observations compared to machine sensor observations in Physical-Cyber-Social (PCS) Systems.
Physical processes are inherently complex and embody uncertainties. They manifest as machine and citizen sensor observations in PCS Systems. We propose a generic framework to move from observations to decision-making and actions in PCS systems consisting of: (a) PCS event extraction, (b) PCS event understanding, and (c) PCS action recommendation. We demonstrate the role of Probabilistic Graphical Models (PGMs) as a unified framework to deal with uncertainty, complexity, and dynamism that help translate observations into actions. Data driven approaches alone are not guaranteed to be able to synthesize PGMs reflecting real-world dependencies accurately. To overcome this limitation, we propose to empower PGMs using the declarative domain knowledge. Specifically, we propose four techniques: (a) automatic creation of massive training data for Conditional Random Fields (CRFs) using domain knowledge of entities used in PCS event extraction, (b) Bayesian Network structure refinement using causal knowledge from Concept Net used in PCS event understanding, (c) knowledge-driven piecewise linear approximation of nonlinear time series dynamics using Linear Dynamical Systems (LDS) used in PCS event understanding, and the (d) transforming knowledge of goals and actions into a Markov Decision Process (MDP) model used in PCS action recommendation.
We evaluate the benefits of the proposed techniques on real-world applications involving traffic analytics and Internet of Things (IoT).
Video: https://www.youtube.com/watch?v=ZCToaDgxnAs
Abstract:
People's emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.
Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.
Understanding users’ latent intents behind search queries is essential for satisfying a user’s search needs. Search intent mining can help search engines to enhance its ranking of search results, enabling new search features like instant answers, personalization, search result diversification, and the recommendation of more relevant ads. Consequently, there has been increasing attention on studying how to effectively mine search intents by analyzing search engine query logs. While state-of-the-art techniques can identify the domain of the queries (e.g. sports, movies, health), identifying domain-specific intent is still an open problem. Among all the topics available on the Internet, health is one of the most important in terms of impact on the user and it is one of the most frequently searched areas. This dissertation presents a knowledge-driven approach for domain-specific search intent mining with a focus on health-related search queries.
First, we identified 14 consumer-oriented health search intent classes based on inputs from focus group studies and based on analyses of popular health websites, literature surveys, and an empirical study of search queries. We defined the problem of classifying millions of health search queries into zero or more intent classes as a multi-label classification problem. Popular machine learning approaches for multi-label classification tasks (namely, problem transformation and algorithm adaptation methods) were not feasible due to the limitation of label data creations and health domain constraints. Another challenge in solving the search intent identification problem was mapping terms used by laymen to medical terms. To address these challenges, we developed a semantics-driven, rule-based search intent mining approach leveraging rich background knowledge encoded in Unified Medical Language System (UMLS) and a crowd sourced encyclopedia (Wikipedia). The approach can identify search intent in a disease-agnostic manner and has been evaluated on three major diseases.
While users often turn to search engines to learn about health conditions, a surprising amount of health information is also shared and consumed via social media, such as public social platforms like Twitter. Although Twitter is an excellent information source, the identification of informative tweets from the deluge of tweets is the major challenge. We used a hybrid approach consisting of supervised machine learning, rule-based classifiers, and biomedical domain knowledge to facilitate the retrieval of relevant and reliable health information shared on Twitter in real time. Furthermore, we extended our search intent mining algorithm to classify health-related tweets into health categories. Finally, we performed a large-scale study to compare health search intents and features that contribute in the expression of search intent from 100+ million search queries from smarts devices (smartphones/tablets) and personal computers (desktops/laptops)
Video of the talk: https://www.youtube.com/watch?v=7k-u_TUew3o
Abstract: Social media has experienced immense growth in recent times. These platforms are becoming increasingly common for information seeking and consumption, and as part of its growing popularity, information overload pose a significant challenge to users. For instance, Twitter alone generates around 500 million tweets per day and it is impractical for users to have to parse through such an enormous stream to find information that are interesting to them. This situation necessitates efficient personalized filtering mechanisms for users to consume relevant, interesting information from social media.
Building a personalized filtering system involves understanding users interests and utilizing these interests to deliver relevant information to users. These tasks primarily include analyzing and processing social media text which is challenging due to its shortness in length, and the real-time nature of the medium. The challenges include: (1) Lack of semantic context: Social Media posts are on an average short in length, which provides limited semantic context to perform textual analysis. This is particularly detrimental for topic identification which is a necessary task for mining users interests; (2) Dynamically changing vocabulary: Most social media websites such as Twitter and Facebook generate posts that are of current (timely) interests to the users. Due to this real-time nature, information relevant to dynamic topics of interest evolve reflecting the changes in the real world. This in turn changes the vocabulary associated with these dynamic topics of interest making it harder to filter relevant information; (3) Scalability: The number of users on social media platforms are significantly large, which is difficult for centralized systems to scale to deliver relevant information to users. This dissertation is devoted to exploring semantic techniques and Semantic Web technologies to address the above mentioned challenges in building a personalized information filtering system for social media. Particularly, the necessary semantics (knowledge) is derived from crowd sourced knowledge bases such as Wikipedia to improve context for understanding short-text and dynamic topics on social media.
Vahid Taslimitehrani's Dissertation Defense: Friday, February 19 2015.
Ph.D. Committee: Drs. Guozhu Dong, Advisor, T.K. Prasad, Amit Sheth, Keke Chen
and Jyotishman Pathak, Division of Health Informatics, Weill Cornell Medical College, Cornell University.
ABSTRACT:
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
1) The document discusses a semantics-based approach to machine perception that uses semantic web technologies to derive abstractions from sensor data using background knowledge on the web.
2) It addresses three primary issues: annotation of sensor data, developing a semantic sensor web, and enabling semantic perception intelligence at the edge on resource-constrained devices.
3) The approach represents background knowledge and sensor observations using ontologies, and uses deductive and abductive reasoning over these representations to interpret sensor data at multiple levels of abstraction.
The document summarizes Cartic Ramakrishnan's dissertation on extracting semantic metadata from text to facilitate knowledge discovery in biomedicine. It defines knowledge discovery as opportunistic search over an ill-defined space leading to surprising but useful knowledge. It discusses using ontologies and text mining to extract semantic relationships from unstructured text and represent them as structured semantic metadata to enable knowledge exploration and discovery. It presents preliminary work on automating some of Swanson's biomedical discoveries by extracting relationships between concepts from parsed sentences in publications.
This document discusses user-generated content on social media and the challenges and opportunities it presents. It focuses on analyzing content at the micro-level by examining named entities, topics, intentions, and word usage. It presents approaches for identifying cultural entities, determining user intentions, and measuring the complexity of extracting entities from different contexts. Analyzing user intentions through identifying patterns surrounding named entities can help recognize information seeking, sharing, and transactional intents with applications such as targeted online advertising.
The document describes semantic provenance modeling for scientific data and experiments. It discusses developing an upper-level provenance ontology called Provenir to serve as a foundation for domain-specific provenance ontologies. It also covers tracking provenance information for scientific workflows and experiments in a modular, multi-ontology approach.
The document outlines Pablo Mendes' PhD dissertation defense on adaptive semantic annotation of entities and concepts in text. It discusses Pablo Mendes' conceptual model for knowledge base tagging, the DBpedia knowledge base and DBpedia Spotlight system, core evaluations of the system, and case studies applying the system to tweets, audio transcripts, and educational material. The presentation concludes by thanking the audience.
Description - Ajith defended his thesis on application and data portability in cloud
computing. More details on Ajith's research and publications can be
found at http://knoesis.wright.edu/researchers/ajith/
Video can be found at : http://www.youtube.com/watch?v=oDBeBIIFmHc&list=UUORqXk1ZV44MOwpCorAROyQ&index=1&feature=plpp_video
Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges. This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks. The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award.
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
The document provides an overview of funding and active projects at Kno.e.sis as of December 2015. Key details include total extramural funds exceeding $8.3 million with the majority obtained that year from competitive NSF and NIH sources. Active projects focus on areas such as context-aware harassment detection on social media, monitoring drug trends on social media, disaster management using social and physical sensing, and modeling social behavior for healthcare utilization in depression. The summary highlights student and faculty involvement and accomplishments across multiple funded projects.
Krishnaprasad Thirunarayan, Trust Management: Multimodal Data Perspective,
Invited Tutorial, The 2015 International Conference on Collaboration
Technologies and Systems (CTS 2015), June 2015
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersAmit Sheth
Abstract
Kno.e.sis (http://knoesis.org) is a world-class research center that uses semantic, cognitive, and perceptual computing for gathering insights from physical/IoT, cyber/Web, and social and enterprise (e.g., clinical) big data. We innovate and employ semantic web, machine learning, NLP/IR, data mining, network science and highly scalable computing techniques. Our highly interdisciplinary research impacts health and clinical applications, biomedical and translational research, epidemiology, cognitive science, social good, policy, development, etc. A majority of our $12+ million in active funds come from the NSF and NIH. In this talk, I will provide an overview of some of our major research projects.
Kno.e.sis is highly successful in its primary mission of exceptional student outcomes: our students have exceptional publication and real-world impact and our PhDs compete with their counterparts from top 10 schools for initial jobs in research universities, top industry research labs, and highly competitive companies. A key reason for Kno.e.sis' success is its unique work culture involving teamwork to solve complex problems. Practically all our work involves real-world challenges, real-world data, interdisciplinary collaborators, path-breaking research to solve challenges, real-world deployments, real-world use, and measurable real-world impact.
In this talk, I will also seek to discuss our choice of research topics and our unique ecosystem that prepares our students for exceptional careers.
This tutorial presents tools and techniques for effectively utilizing the Internet of Things (IoT) for building advanced applications, including the Physical-Cyber-Social (PCS) systems. The issues and challenges related to IoT, semantic data modelling, annotation, knowledge representation (e.g. modelling for constrained environments, complexity issues and time/location dependency of data), integration, analy- sis, and reasoning will be discussed. The tutorial will de- scribe recent developments on creating annotation models and semantic description frameworks for IoT data (e.g. such as W3C Semantic Sensor Network ontology). A review of enabling technologies and common scenarios for IoT applications from the data and knowledge engineering point of view will be discussed. Information processing, reasoning, and knowledge extraction, along with existing solutions re- lated to these topics will be presented. The tutorial summarizes state-of-the-art research and developments on PCS systems, IoT related ontology development, linked data, do- main knowledge integration and management, querying large- scale IoT data, and AI applications for automated knowledge extraction from real world data.
Related: Semantic Sensor Web: http://knoesis.org/projects/ssw
Physical-Cyber-Social Computing: http://wiki.knoesis.org/index.php/PCS
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
The Ohio Center of Excellence in Knowledge-enabled Computing at Wright State University:
1) Shares the second position globally in impact on the World Wide Web and has the largest academic research group in the US working on semantic web, social media, big data, and health applications.
2) Has exceptional student success with internships and jobs at top companies and a total of 100 researchers including 15 highly cited faculty and 45 PhD students, largely funded through $2M+ annually in research funding.
3) Provides world-class resources for multidisciplinary projects across information technology and domains like biomedicine, with collaboration from industry partners like Google and IBM.
Student Achievement Review (initially presented during Inauguration Function of the Ohio Center of Excellence in Knowledge-Enabled Computing at Wright State (Kno.e.sis)) - updated since
Center overview: http://bit.ly/coe-k
Invitation: http://bit.ly/COE-invite
Context-Aware Harassment Detection on Social Media
is an inter-disciplinary project among the Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis), the Department of Psychology, and Center for Urban and Public Affairs (CUPA) at Wright State University. The aim of this project is to develop comprehensive and reliable context-aware techniques (using machine learning, text mining, natural language processing, and social network analysis) to glean information about the people involved and their interconnected network of relationships, and to determine and evaluate potential harassment and harassers. An interdisciplinary team of computer scientists, social scientists, urban and public affairs professionals, educators, and the participation of college and high schools students in the research will ensure wide impact of scientific research on the support for safe social interactions.
Mining and Analyzing Subjective and Experiences in Social Media TextLu Chen
The document outlines Lu Chen's Ph.D. thesis proposal on mining and analyzing subjective experiences from social media text. It defines subjective experience as a quadruple consisting of a holder, stimuli, expression, and classification. It discusses the opportunities to obtain subjective information like sentiment, opinion, emotion, preference, intent, and expectation from social media and how this information can support decision making. Finally, it provides examples of different types of subjective information and how they are characterized.
This document provides an overview of sentiment analysis and discusses why it is an important area of research in language technology. Sentiment analysis involves detecting positive or negative opinions in text about products, politicians, or other topics. It has many applications, such as determining how consumers feel about a new product or predicting election outcomes based on public sentiment. The document also discusses challenges in modeling affective meaning in language at the lexical level in order to perform tasks like sentiment analysis.
The document summarizes Cartic Ramakrishnan's dissertation on extracting semantic metadata from text to facilitate knowledge discovery in biomedicine. It defines knowledge discovery as opportunistic search over an ill-defined space leading to surprising but useful knowledge. It discusses using ontologies and text mining to extract semantic relationships from unstructured text and represent them as structured semantic metadata to enable knowledge exploration and discovery. It presents preliminary work on automating some of Swanson's biomedical discoveries by extracting relationships between concepts from parsed sentences in publications.
This document discusses user-generated content on social media and the challenges and opportunities it presents. It focuses on analyzing content at the micro-level by examining named entities, topics, intentions, and word usage. It presents approaches for identifying cultural entities, determining user intentions, and measuring the complexity of extracting entities from different contexts. Analyzing user intentions through identifying patterns surrounding named entities can help recognize information seeking, sharing, and transactional intents with applications such as targeted online advertising.
The document describes semantic provenance modeling for scientific data and experiments. It discusses developing an upper-level provenance ontology called Provenir to serve as a foundation for domain-specific provenance ontologies. It also covers tracking provenance information for scientific workflows and experiments in a modular, multi-ontology approach.
The document outlines Pablo Mendes' PhD dissertation defense on adaptive semantic annotation of entities and concepts in text. It discusses Pablo Mendes' conceptual model for knowledge base tagging, the DBpedia knowledge base and DBpedia Spotlight system, core evaluations of the system, and case studies applying the system to tweets, audio transcripts, and educational material. The presentation concludes by thanking the audience.
Description - Ajith defended his thesis on application and data portability in cloud
computing. More details on Ajith's research and publications can be
found at http://knoesis.wright.edu/researchers/ajith/
Video can be found at : http://www.youtube.com/watch?v=oDBeBIIFmHc&list=UUORqXk1ZV44MOwpCorAROyQ&index=1&feature=plpp_video
Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges. This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks. The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award.
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
The document provides an overview of funding and active projects at Kno.e.sis as of December 2015. Key details include total extramural funds exceeding $8.3 million with the majority obtained that year from competitive NSF and NIH sources. Active projects focus on areas such as context-aware harassment detection on social media, monitoring drug trends on social media, disaster management using social and physical sensing, and modeling social behavior for healthcare utilization in depression. The summary highlights student and faculty involvement and accomplishments across multiple funded projects.
Krishnaprasad Thirunarayan, Trust Management: Multimodal Data Perspective,
Invited Tutorial, The 2015 International Conference on Collaboration
Technologies and Systems (CTS 2015), June 2015
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersAmit Sheth
Abstract
Kno.e.sis (http://knoesis.org) is a world-class research center that uses semantic, cognitive, and perceptual computing for gathering insights from physical/IoT, cyber/Web, and social and enterprise (e.g., clinical) big data. We innovate and employ semantic web, machine learning, NLP/IR, data mining, network science and highly scalable computing techniques. Our highly interdisciplinary research impacts health and clinical applications, biomedical and translational research, epidemiology, cognitive science, social good, policy, development, etc. A majority of our $12+ million in active funds come from the NSF and NIH. In this talk, I will provide an overview of some of our major research projects.
Kno.e.sis is highly successful in its primary mission of exceptional student outcomes: our students have exceptional publication and real-world impact and our PhDs compete with their counterparts from top 10 schools for initial jobs in research universities, top industry research labs, and highly competitive companies. A key reason for Kno.e.sis' success is its unique work culture involving teamwork to solve complex problems. Practically all our work involves real-world challenges, real-world data, interdisciplinary collaborators, path-breaking research to solve challenges, real-world deployments, real-world use, and measurable real-world impact.
In this talk, I will also seek to discuss our choice of research topics and our unique ecosystem that prepares our students for exceptional careers.
This tutorial presents tools and techniques for effectively utilizing the Internet of Things (IoT) for building advanced applications, including the Physical-Cyber-Social (PCS) systems. The issues and challenges related to IoT, semantic data modelling, annotation, knowledge representation (e.g. modelling for constrained environments, complexity issues and time/location dependency of data), integration, analy- sis, and reasoning will be discussed. The tutorial will de- scribe recent developments on creating annotation models and semantic description frameworks for IoT data (e.g. such as W3C Semantic Sensor Network ontology). A review of enabling technologies and common scenarios for IoT applications from the data and knowledge engineering point of view will be discussed. Information processing, reasoning, and knowledge extraction, along with existing solutions re- lated to these topics will be presented. The tutorial summarizes state-of-the-art research and developments on PCS systems, IoT related ontology development, linked data, do- main knowledge integration and management, querying large- scale IoT data, and AI applications for automated knowledge extraction from real world data.
Related: Semantic Sensor Web: http://knoesis.org/projects/ssw
Physical-Cyber-Social Computing: http://wiki.knoesis.org/index.php/PCS
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
The Ohio Center of Excellence in Knowledge-enabled Computing at Wright State University:
1) Shares the second position globally in impact on the World Wide Web and has the largest academic research group in the US working on semantic web, social media, big data, and health applications.
2) Has exceptional student success with internships and jobs at top companies and a total of 100 researchers including 15 highly cited faculty and 45 PhD students, largely funded through $2M+ annually in research funding.
3) Provides world-class resources for multidisciplinary projects across information technology and domains like biomedicine, with collaboration from industry partners like Google and IBM.
Student Achievement Review (initially presented during Inauguration Function of the Ohio Center of Excellence in Knowledge-Enabled Computing at Wright State (Kno.e.sis)) - updated since
Center overview: http://bit.ly/coe-k
Invitation: http://bit.ly/COE-invite
Context-Aware Harassment Detection on Social Media
is an inter-disciplinary project among the Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis), the Department of Psychology, and Center for Urban and Public Affairs (CUPA) at Wright State University. The aim of this project is to develop comprehensive and reliable context-aware techniques (using machine learning, text mining, natural language processing, and social network analysis) to glean information about the people involved and their interconnected network of relationships, and to determine and evaluate potential harassment and harassers. An interdisciplinary team of computer scientists, social scientists, urban and public affairs professionals, educators, and the participation of college and high schools students in the research will ensure wide impact of scientific research on the support for safe social interactions.
Mining and Analyzing Subjective and Experiences in Social Media TextLu Chen
The document outlines Lu Chen's Ph.D. thesis proposal on mining and analyzing subjective experiences from social media text. It defines subjective experience as a quadruple consisting of a holder, stimuli, expression, and classification. It discusses the opportunities to obtain subjective information like sentiment, opinion, emotion, preference, intent, and expectation from social media and how this information can support decision making. Finally, it provides examples of different types of subjective information and how they are characterized.
This document provides an overview of sentiment analysis and discusses why it is an important area of research in language technology. Sentiment analysis involves detecting positive or negative opinions in text about products, politicians, or other topics. It has many applications, such as determining how consumers feel about a new product or predicting election outcomes based on public sentiment. The document also discusses challenges in modeling affective meaning in language at the lexical level in order to perform tasks like sentiment analysis.
We think in stories. Narratives are tools we use to make sense of the world, both in life and in games. Seeing how all stories work in a similiar way, and how all gameplay loops share their structure with stories, we will explore the similarities and look for tools that will help us design better games. This talk, inspired by John Yorke's book "Into the Woods. How stories work and why we tell them", and based on over ten years of experience in the industry, aims to present a consistent narrative-driven approach to game design.
A talk from Digital Dragons 2018
T:\staff resource\7th grade\7th language arts\writing test\evaluativePaula Layton
This document provides guidance on writing argumentative essays. It outlines the key parts of problem-solution and evaluation essays, including presenting the problem clearly, supporting the solution with reasons, and stating a judgment while supporting it with evidence. It also lists the typical parts of an essay and steps for writing, such as identifying the prompt, determining the purpose, audience and context, and developing reasons and examples to support the thesis. Transition words and ways to elaborate on ideas are also included.
The Reason Seeking Transfer AdmissionApplication EssayJennifer Strong
1. The document describes the steps to get writing help from HelpWriting.net, including creating an account, submitting a request, reviewing writer bids, authorizing payment, and requesting revisions.
2. Users complete a form with instructions and deadline, and writers bid on the request. The user then chooses a writer and pays a deposit to start the work.
3. The user reviews the completed paper and can request revisions until satisfied before making final payment. HelpWriting.net guarantees original, high-quality work or a full refund.
This document discusses analyzing theme in literary works. It defines theme as the central meaning or lesson of a work. It explains how themes can be revealed through symbols and explored from the perspectives of writers, readers, and culture. The document provides examples of common themes and encourages asserting arguments about themes through evidence from works. It cautions that there may be multiple themes and meanings in a single work.
How to design inner play in a study narrative? Eva Den Heijer
Workshop at the Serious Play Conference in Montreal July 10-12 2019 seriousplay-montreal.com UNIVERSITÉ DU QUÉBEC À MONTRÉAL /UNIVERSITY OF QUEBEC IN MONTREAL
#Conversatorio - Ciencia de datos como ventana a la sociedad. (Edgar Altszyle...Aprender 3C
Conversatorio sobre Big Data y Datos Abiertos (28/03/2018)
Más información en #APRENDER3C http://aprender3c.org/conversatorio-sobre-big-data-y-datos-abiertos/
Extracting What We Think and How We Feel from What We Say in Social MediaLu Chen
This document discusses subjective information extraction from social media text. It begins by outlining a progression from coarse-grained to fine-grained analysis, and from static to dynamic models. It then describes approaches for extracting candidate sentiment expressions, identifying relations between expressions, and assessing target-dependent sentiment polarity. Finally, it provides examples and discusses applications like predicting election results based on analyzing sentiments expressed by different user groups on social media.
essay on my favourite teacher in english. Essay on Teacher in English for Kids and Students | 500 Words Essay on .... Essay- Becoming an excellent teacher. Who Is Your Favourite Teacher Essay | Sitedoct.org. Descriptive Text My Favorite Teacher - belajarsoalcrunch. Essay on my Best Teacher || Write essay on my Best Teacher in English || Handwriting essay. My best teacher - Essay on My Best Teacher - Easy and Short Essay on My .... How to Become the Best Teacher Essay Writing Service Advice by Neena .... A Good Teacher Essay | PDF | Teachers | Learning. My favourite teacher essay in english || Essay about my teacher - Learn .... Kids Essay On My Favourite Teacher - 276 Words Essay on My Favourite .... Essay about a teacher who inspired you.
The document provides an overview of key concepts for analyzing films, including form, structure, meaning, themes, characterization, and narrative elements. It discusses how a film's structure can create meaning and focus audience attention. It also defines different types of conflicts, meanings, symbols, and narrative patterns that can be examined when analyzing a film.
The document discusses various concepts related to analyzing the form and structure of films and how they create meaning, including:
1) It examines different types of conflicts that can be presented in films such as man vs man, man vs society, etc. and how they focus audience attention.
2) It outlines several themes commonly explored in films such as morality, human nature, social problems, human dignity, and complexity of relationships.
3) It describes different types of meaning that can be conveyed in films such as emotional, referential, explicit, and implicit meanings.
Persuasive Texts: The language of persuasion by Jeni MawterJeni Mawter
Children's and Young Adult Author and Writing Teacher Jeni Mawter shares her knowledge and insights in persuasive writing techniques.
Suitable for NAPLAN students.
The Social SelfThree Motivations in Social PsychologyM.docxjoshua2345678
The Social Self
Three Motivations in Social Psychology
Motivation for certainty We need to feel like we understand our environment
Motivation for esteem We need to feel competent and proud of ourselves
Motivation for belonging We need pleasant and stable social connections
Three Motivations in Social Psychology
Motivation for certainty
Motivation for esteem
Motivation for belonging
The way we view our self-concept is driven largely by these
Self-concept: A system of knowledge and beliefs about our personal qualities
Self-Concept
Me
My likes
My values
My skills
My traits
My feelings
Sources of Knowledge
Write a list of 5 subjective traits that define you
I am outgoing (subjective)
I am a student (objective)
These can include your values, likes, talents, personality traits, aspirations, etc.
Next to each one: How do you know?
Knowledge Through Behavior
Self-perception theory We learn about ourselves by watching our own behaviors
I am talking in front of a classroom, so I must be outgoing
Our behaviors are more telling when:
They are freely chosen
There is no reward
Knowledge Through Behavior
Children asked to draw with new markers for 20 minutes
Condition 1: Promised a “Good player” certificate
2 weeks later, played with markers 8% of free-time
Condition 2: Not promised a certificate, but surprised with one
2 weeks later, played with markers 16% of free-time
Condition 3: Not promised a certificate, not given one
2 weeks later, played with markers 16% of free-time
Knowledge Through Behavior
Why?
Condition 1: I played with these before, but it was for a certificate, so I don’t actually like them
Condition 2 + 3: I played with these before without the promise of a certificate, so I must like them!
Knowledge Through Thoughts/Feelings
Similarly, we use our thoughts and feelings to define who we are
I am calm in this classroom right now, so I must be outgoing
Knowledge Through Feelings/Thoughts
Researchers put a female surveyor in a national park to approach men
Condition 1: Men approached in an open field/ picnic area
Condition 2: Men approached as they crossed a swaying, unstable bridge
A second researcher approaches after and asks how attractive that female was
Men in condition 2 reported being more attracted to her
Knowledge Through Feelings/Thoughts
Conclusion:
Being on the bridge elevated heartrate, sweating, and attention
Men took this to mean attraction
Knowledge From Others’ Reactions
We understand who we are through the ways others treat us/react to us.
My brother asked me to talk at his wedding, so I must be outgoing
We respond to both obvious and subtle information from others:
Obvious: Others tell me I am creative, people are impressed by what I create
Subtle: My brother asks me to help design his wedding invitations
Works more with under-developed self-concepts or new domains
Knowledge from Social Comparisons
We judge who we are by how we co.
This presentation will support the webinar and covers;
What is a baseline and why is it important?
Baseline questions
What to look for and take note of
Common errors when establishing/interpreting a baseline
Techniques for building rapport
Mirroring, understanding, sharing experiences
How to conduct conversational style interviews
Building rapport across cultures
How to ensure you come across as sincere
Exercises for developing and improving skills
Gackenbach, J.I. (2009, June). Dreams and Video Game Play. Planary Session paper presented at Toward a Science of Consciousness : Investigating Inner Experience – Brain, Mind, Technology, Hong Kong, China
The document discusses Joseph Conrad's 1899 novel Heart of Darkness. It explores how the novel examines the ideas of imperialism and darkness. The character of Mr. Kurtz represents imperialism and how pursuing power can corrupt someone, as seen in how Kurtz controls and dehumanizes the native people. The novel presents a story that gives insights into the horror of the real world and imperialism through the character of Marlow and his realization about Kurtz.
Why Self Reflection Is Important To Community SCrystal Jackson
The document provides instructions for how to request and complete an assignment writing request on the HelpWriting.net website. It outlines the 5 step process: 1) Create an account and log in, 2) Complete an order form with instructions and deadline, 3) Review writer bids and choose one, 4) Review the completed paper and authorize payment, 5) Request revisions if needed, knowing revisions and refunds are available.
This document describes an online class called "Secrets" that integrates gameplay, story, and learning. It discusses how collateral learning occurs at the convergence of storytelling and gameplay, fueled by human curiosity and imagination. The class takes place online and also functions as a narrative-driven game. Students must question what they are told, investigate websites to solve puzzles, and share knowledge with each other in order to discover the truth behind the storyline. The class aims to harness students' natural desires to learn and play games into an engaging educational experience.
Similar to Mining and Analyzing Subjective Experiences in User-generated Content (20)
Lifecycle of a GME Trader: From Newbie to Diamond Handsmediavestfzllc
Your phone buzzes with a Reddit notification. It's the WallStreetBets forum, a cacophony of memes, rocketship emojis, and fervent discussions about Gamestop (GME) stock. A spark ignites within you - a mix of internet bravado, a rebellious urge to topple the hedge funds (remember Mr. Mayo?), and maybe that one late-night YouTube rabbit hole about tendies. You decide to YOLO (you only live once, right?).
Ramen noodles become your new best friend. Every spare penny gets tossed into the GME piggy bank. You're practically living on fumes, but the dream of a moonshot keeps you going. Your phone becomes an extension of your hand, perpetually glued to the GME ticker. It's a roller-coaster ride - every dip a stomach punch, every rise a shot of adrenaline.
Then, it happens. Roaring Kitty, the forum's resident legend, fires off a cryptic tweet. The apes, as the GME investors call themselves, erupt in a frenzy. Could this be it? Is the rocket finally fueled for another epic launch? You grip your phone tighter, heart pounding in your chest. It's a wild ride, but you're in it for the long haul.
The Evolution of SEO: Insights from a Leading Digital Marketing AgencyDigital Marketing Lab
Explore the latest trends in Search Engine Optimization (SEO) and discover how modern practices are transforming business visibility. This document delves into the shift from keyword optimization to user intent, highlighting key trends such as voice search optimization, artificial intelligence, mobile-first indexing, and the importance of E-A-T principles. Enhance your online presence with expert insights from Digital Marketing Lab, your partner in maximizing SEO performance.
Your LinkedIn Success Starts Here.......SocioCosmos
In order to make a lasting impression on your sector, SocioCosmos provides customized solutions to improve your LinkedIn profile.
https://www.sociocosmos.com/product-category/linkedin/
Telegram is a messaging platform that ushers in a new era of communication. Available for Android, Windows, Mac, and Linux, Telegram offers simplicity, privacy, synchronization across devices, speed, and powerful features. It allows users to create their own stickers with a user-friendly editor. With robust encryption, Telegram ensures message security and even offers self-destructing messages. The platform is open, with an API and source code accessible to everyone, making it a secure and social environment where groups can accommodate up to 200,000 members. Customize your messenger experience with Telegram's expressive features.
This tutorial presentation provides a step-by-step guide on how to use Facebook, the popular social media platform. In simple and easy-to-understand language, this presentation explains how to create a Facebook account, connect with friends and family, post updates, share photos and videos, join groups, and manage privacy settings. Whether you're new to Facebook or just need a refresher, this presentation will help you navigate the features and make the most of your Facebook experience.
EASY TUTORIAL OF HOW TO USE G-TEAMS BY: FEBLESS HERNANEFebless Hernane
Using Google Teams (G-Teams) is simple. Start by opening the Google Teams app on your phone or visiting the G-Teams website on your computer. Sign in with your Google account. To join a meeting, click on the link shared by the organizer or enter the meeting code in the "Join a Meeting" section. To start a meeting, click on "New Meeting" and share the link with others. You can use the chat feature to send messages and the video button to turn your camera on or off. G-Teams makes it easy to connect and collaborate with others!
EASY TUTORIAL OF HOW TO USE REMINI BY: FEBLESS HERNANEFebless Hernane
Using Remini is easy and quick for enhancing your photos. Start by downloading the Remini app on your phone. Open the app and sign in or create an account. To improve a photo, tap the "Enhance" button and select the photo you want to edit from your gallery. Remini will automatically enhance the photo, making it clearer and sharper. You can compare the before and after versions by swiping the screen. Once you're happy with the result, tap "Save" to store the enhanced photo in your gallery. Remini makes your photos look amazing with just a few taps!
Project Serenity is an innovative initiative aimed at transforming urban environments into sustainable, self-sufficient communities. By integrating green architecture, renewable energy, smart technology, sustainable transportation, and urban farming, Project Serenity seeks to minimize the ecological footprint of cities while enhancing residents' quality of life. Key components include energy-efficient buildings, IoT-enabled resource management, electric and autonomous transportation options, green spaces, and robust waste management systems. Emphasizing community engagement and social equity, Project Serenity aspires to serve as a global model for creating eco-friendly, livable urban spaces that harmonize modern conveniences with environmental stewardship.
Surat Digital Marketing School is created to offer a complete course that is specifically designed as per the current industry trends. Years of experience has helped us identify and understand the graduate-employee skills gap in the industry. At our school, we keep up with the pace of the industry and impart a holistic education that encompasses all the latest concepts of the Digital world so that our graduates can effortlessly integrate into the assigned roles.
This is the place where you become a Digital Marketing Expert.
This tutorial presentation offers a beginner-friendly guide to using THREADS, Instagram's messaging app. It covers the basics of account setup, privacy settings, and explores the core features such as close friends lists, photo and video sharing, creative tools, and status updates. With practical tips and instructions, this tutorial will empower you to use THREADS effectively and stay connected with your close friends on Instagram in a private and engaging way.
HOW TO USE THREADS an Instagram App_ by Clarissa Credito
Mining and Analyzing Subjective Experiences in User-generated Content
1. Lu Chen
Kno.e.sis Center
Ph.D. Dissertation Defense
Advisor:
Prof. Amit P. Sheth
Committee members:
Prof. T.K. Prasad
Prof. Keke Chen
Dr. Ingmar Weber (QCRI)
Dr. Justin Martineau (SRA)
Ohio Center of Excellence in Knowledge-Enabled Computing
Mining and Analyzing Subjective
Experiences in User Generated
Content
2. Subjective Experience –
What We Experience in Our Mind
Hunger
Love
Happiness
Surprise
Embarrassment
Like
Dislike
Confused
Pain
Tired Stressed
Nervous
Relaxed
Warm
Proud
Confident
Taste of ice cream
Feeling about sky
Perception of time Appreciation of music
Opinion on climate change
InterestSource: http://bit.ly/1DvofHX
2
Music preference
Purchase intent
3. Subjective Information – The Information
about People’s Subjective Experiences
Source: http://bit.ly/1GDD9Mb
Source: http://bit.ly/1KkJF2l
Source: http://bit.ly/1IjjBSX
Source: http://bit.ly/1KkK1Gc
The traditional way of collecting subjective information:
3
4. User Generated Content
• New opportunities arise as we now can obtain a wide variety of
subjective information from user generated content.
4
5. The Demand of Subjective Information
• Subjective information can be used to support better decision-
making.
5
Source: http://twitris2.knoesis.org/debate
Predicting election results
Source: http://bit.ly/1gQg5Fl
Monitoring social phenomena
Source: http://bit.ly/1niFkU7
Targeted advertising
Source: http://bit.ly/1l0ombo
Making purchase decision
Source: http://bit.ly/1VzYEZG
6. Different Types of Subjective Information
Intent “would like to watch”
Expectation “hope it’s good”
would like to watch The Secret Life Of
Pets. I hope it's good.
"The Secret Life of Pets" was clever,
adorable, funny and I already want to
see it again.
I don't think watching The Secret Life
of Pets makes me childish. I laughed I
cried and it was so touching for
someone who has a pet like me.
Finding Dory was much better than
The Secret Life of Pets. Still not as good
as Zootopia though.
6
The Secret Life of Pets soundtrack
should be nominated for an Oscar
Sentiment “clever, adorable, funny”
Intent “want to see it again”
Opinion
“don’t think watching …
makes me childish”
Emotion
“I laughed I cried and it
was so touching”
Preference “much better than”
Preference “not as good as”
Opinion “should be nominated
for an Oscar”
7. Defining Subjective Information
cesh ,,,
Formally, a subjective experience can be represented as a quadruple
𝒉 − a holder, an individual who holds the experiences
𝒔 − a stimulus (or target), an entity, event or situation that elicits
the experiences.
𝒆 − a set of expressions that are used to describe the experience, e.g.,
the sentiment words/phrases or the opinion claims.
𝒄 − a classification or assessment that categorizes or measures the
exeprience, e.g., sentiment orientation (positive vs. negative), emotion
type (joy, anger, sadness, surprise, etc.), a score indicating the strength
of sentiment.
7
8. Different Types of Subjective Information
8
𝐇𝐨𝐥𝐝𝐞𝐫 𝒉 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝒄
Sentiment
an individual who
holds the sentiment
an entity
sentiment
words/phrases
positive, negative,
neutral
Opinion
an individual who
holds the opinion
an entity
opinion claims (may not
contain sentiment words)
positive, negative,
neutral
Emotion
an individual who
holds the emotion
an event or
situation
emotion words/phrases,
description of
events/situations
anger, disgust, fear,
happiness, sadness,
surprise
Preference
an individual who
holds the preference
a set of
alternatives
words/phrases that
indicate comparison or
preference
depend on specific
tasks
Intent
an individual who
holds the intent
an action
words/phrases that show
the presence of will,
description of the act
depend on specific
tasks
Expectation
an individual who
holds the
expectation
an entity
words/phrases that
express the beliefs about
someone or something
will be.
depend on specific
tasks
9. 9
* The holders of these experiences are the authors of the messages.
Example Type 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝒄
would like to watch The
Secret Life Of Pets. I hope
it's good.
Intent watch the movie “would like to” transactional
Expectation The Secret Life of
Pets movie
“hope” optimistic
"The Secret Life of Pets" was
clever, adorable, funny and I
already want to see it again.
sentiment The Secret Life of
Pets movie
“clever”, “funny”,
“adorable”
positive
Intent see the movie “want to” transactional
I don't think watching The
Secret Life of Pets makes me
childish. I laughed I cried
and it was so touching for
someone who has a pet like
me.
Opinion The Secret Life of
Pets movie
“don’t think …
makes me
childish”
positive
Emotion The Secret Life of
Pets movie
“laughed”, “cried”,
“so touching”
funny, touching
Finding Dory was much
better than The Secret Life
of Pets. Still not as good as
Zootopia though.
preference Finding Dory, The
Secret Life of Pets
“much better
than”
preferring Finding
Dory
preference Finding Dory,
Zootopia
“not as good as” Preferring
Zootopia
The Secret Life of Pets
soundtrack should be
nominated for an Oscar
Opinion The Secret Life of
Pets soundtrack
“should be
nominated for an
Oscar”
positive
10. 10
An overview of subjective
information extraction.
The box colored in orange
indicate the scope of this
dissertation.
11. Dissertation Focus
1. Extraction of Target-
Specific Sentiment
Expressions (ICWSM’12)
2. Discovery of Domain-
Specific Features and
Aspects (NAACL’16)
Emotion Identification
(SocialCom’12, BII’12,
CSCW’14, ACL’14)
3. Application: Predicting
Election Results (SocInfo’12)
• Identifying and extracting subjective information from
user generated content.
11
4. Application: Religiosity &
Happiness (SocInfo’14)
Sentiment
Opinion
Emotion
Subjective
Information
𝐒𝐭𝐢𝐦𝐮𝐥𝐢 𝐬
𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆
𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐜
Holder 𝒉
12. Thesis Statement
• This dissertation presents a unified framework that characterizes a
subjective experience, such as sentiment, opinion, or emotion, in terms
of an individual holding it, a target eliciting it, a set of expressions
describing it, and a classification or assessment measuring it;
• it describes new algorithms that automatically identify and extract
sentiment expressions and opinion targets from user generated content
with minimal human supervision;
• it shows how to use social media data to predict election results and
investigate religion and subjective well-being, by classifying and
assessing subjective information in user generated content.
12
13. Sentiment in User Generated Content
Sources: Social media
Data: posts, messages
Targets: movies, persons,
brands, etc.
13
E1. Lights out definitely lived up to the hype! Great movie!
E2. I got my second Pikachu today this one was from 2k egg revitalised my
love for Pokemon go... Did not last long 😆 stoopid game
E3. Game of Thrones is a must watch.
E4. I find myself grateful that Hillary Clinton is predictable and steady. Like
her or don't, she's SAFE.
E5. Saw the avengers last night. Mad overrated. Cheesy lines and horrible
writing. Very predictable.
E6. I saw The Avengers yesterday evening. It was long but it was very good!
E7. Galaxy s7 edge battery life last so long it's almost unlimited battery life
xD
Target
Lights out 75% 20% 5%
Pokemon Go 69% 17% 14%
Game of Thrones 83% 10% 7%
Hillary Clinton 49% 35% 16%
The Avengers 70% 24% 6%
Galaxy S7 Edge 68% 16% 16%
Sentiment Analysis Predictive Models
business
analytics,
predicting
financial
performance,
predicting
election
results
…
14. 1. Extraction of Target-
Specific Sentiment Expressions
14
Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sheth. Extracting Diverse Sentiment
Expressions with Target-dependent Polarity from Twitter. Proceedings of the 6th International AAAI
Conference on Weblogs and Social Media (ICWSM), 2012.
Given a set of unlabeled social media posts, how to
extract diverse forms of sentiment expressions with
respect to a specific target?
15. Example
E1. Lights out definitely lived up to the hype! Great movie!
E2. I got my second Pikachu today this one was from 2k egg revitalised my
love for Pokemon go... Did not last long 😆 stoopid game
E3. Game of Thrones is a must watch.
E4. I find myself grateful that Hillary Clinton is predictable and steady. Like
her or don't, she's SAFE.
E5. Saw the avengers last night. Mad overrated. Cheesy lines and horrible
writing. Very predictable.
E6. I saw The Avengers yesterday evening. It was long but it was very good!
E7. Galaxy s7 edge battery life last so long it's almost unlimited battery life
xD
Instances Sentiment Expressions Classification
E1 lived up to the hype, great positive
E2 love, not last long, stoopid positive, negative
E3 must watch positive
E4 grateful, predictable, steady, safe positive
E5
mad overrated, cheesy, horrible,
very predictable
negative
E6 long, very good negative, positive
E7 last so long, unlimited positive
Sources: Social media
Data: posts, messages
e.g., tweets
Targets: movies, persons,
brands, etc.
15
16. Challenges
• Sentiment expressions can be very diverse.
‒ Vary from single words (e.g., “good”, “predictable”) to multi-word phrases
of different lengths (“lived up to the hype”, “must see”)
‒ Can be formal or slang expressions, including abbreviations and spelling
variations (e.g., “gud”, “stoopid”).
• The polarity of a sentiment expression is sensitive to its target.
‒ E.g., “long” in “long river”, “long battery life”, or “long time for
downloading”.
‒ E.g., “predictable” regarding movies, or regarding stocks.
16
17. Contributions
We propose a novel optimization-based approach that:
• identifies a diverse and richer set of sentiment expressions,
including both formal and slang words/phrases;
• assesses the target-dependent polarity of each sentiment
expression; and
• does not require labeled data or hand-crafted patterns.
17
19. Example:
“The Avengers movie was bloody amazing! A little cheesy at times, but I
liked it. Mmm looking good Robert Downey Jr and Captain America ;)”
“on-target” subjective words: “bloody”, “amazing”, “cheesy”, “liked”
Candidate expressions: “bloody”, “amazing”, “bloody amazing”, “cheesy”,
“little cheesy”, “cheesy at times”, “little cheesy at times”, “liked”
Method:
• For each message, selecting the “on-target” subjective words, and
extracting all the n-grams that contain at least one selected subjective
word as candidates.
• A subjective word is selected as “on-target” if
(1) there is a dependency relation between the word and the target, or
(2) the word is proximate to the target (e.g., within four words distance).
19
Extracting Candidate Expressions
20. Identifying Inter-Expression Relations
1. I saw The Avengers yesterday evening. It was long but it was very good!
2. I do enjoy The Avengers, but it's both overrated and problematic.
3. Saw the avengers last night. Mad overrated. Cheesy lines and horrible
writing. Very predictable.
4. The avengers was good but the plot was just simple minded and predictable.
5. The Avengers was good. I was not disappointed.
20
22. An Optimization Model (1)
• For each candidate expression ,
‒ P-Probability – the probability that indicates positive sentiment
‒ N-Probability – the probability that indicates negative
sentiment
• For each pair of candidate expressions and ,
‒ Consistency probability – the probability that and have the same
polarity:
‒ Inconsistency probability – the probability that and have different
polarities:
ic
)(Pr i
P
c
)(Pr i
N
c
ic
ic
1)(Pr)(Pr i
N
i
P
cc
ic jc
ic jc
)(Pr)(Pr)(Pr)(Pr),(Pr j
N
i
N
j
P
i
P
ji
cons
cccccc
ic jc
)(Pr)(Pr)(Pr)(Pr),(Pr j
P
i
N
j
N
i
P
ji
incons
cccccc
22
23. An Optimization Model (2)
• We want the consistency and inconsistency probabilities derived from
the P-Probabilities and N-Probabilities of the candidates to be closest to
their expectations suggested by the relation networks.
• Objective Function:
1
1
22
),(Pr1),(Pr1minimize
n
i
n
ij
ji
inconsincons
ijji
conscons
ij ccwccw
where and are the weights of the edges (strength of the
relations) between and in the consistency and inconsistency relation
networks, and n is the total number of candidate expressions.
ic jc
cons
ijw incons
ijw
)(Pr)(Pr)(Pr)(Pr),(Pr j
N
i
N
j
P
i
P
ji
cons
cccccc
)(Pr)(Pr)(Pr)(Pr),(Pr j
P
i
N
j
N
i
P
ji
incons
cccccc
23
24. Experiments: Datasets
Table: Description of four
target-specific datasets from
social media.
24
Tweet about movie New Star Trek movie is great! Highly recommend it!
Tweet about person Scarlett Johansson rocking a suit better than most men.
Forum post about
epilepsy treatment
I have an 11 month old who suffers from 0-8 seizures per day. We've tried 6
medications that have all failed and are now on The Ketogenic Diet. The diet has
been amazing at reducing the frequency and intensity of his seizures. However, I
want them GONE! I am wondering if infant chiropractic care or acupuncture is safe
and effective in eliminating seizures. Does anyone have any experience with either
of these?
Forum post about
cellular company
I click on Mobile Sync to move all my contacts from my phone to the Sprint website.
There are over 100 contacts in my phone, but it's only moving 59 of them? Help
Facebook post
about automobile
company
I have a 2006 Trailblazer that had a motor failure at 60,000 miles. GM refused to
help in any way. Poor customer service to say the least. I guess they don't care about
your car post warranty. With a driveway full of GM's its probably the last one I will
buy.
25. Experiments on Tweets
• Datasets:
‒ 168,005 tweets about movies
‒ 258,655 tweets about persons
• Gold standard: 1500 tweets were randomly sampled from each domain.
Human experts identified sentiment expressions and labeled each
expression and tweet with target-specific sentiment.
Table: Distributions of N-
grams and Part-of-speech of
the Sentiment Expressions in
the Gold Standard Data Set.
Table: Distribution of
Sentiment Categories of the
Tweets in the Gold Standard
Data Set.
25
26. Methods
COM -- Constrained Optimization Model
• COM-const: Assign 0.5 to all the candidates as their initial P-
Probabilities.
• COM-gelex: Initialize the candidates’ polarities according to the
subjectivity dictionary. (positive-1.0, negative-0.0, other-0.5)
• MPQA, GI, SWN: For each extracted subjective word regarding
the target, simply look up its polarity in MPQA, General Inquirer
and SentiWordNet, respectively.
• PROP: a propagation approach proposed by Qiu et al. (IJCAI’09)
26
27. Results
27
It demonstrates the advantage of our
optimization-based approach over
the lexicon-based or rule-based
manner in polarity assessment – our
method extracts diverse sentiment
expressions and capture their target-
dependent polarity.
28. Results of Sentiment Expression Extraction with Various Corpora Sizes
Our approach make increases on both
precision and recall when we increase the
size of corpora from 12,000 to 48,000.
Because our method could benefit from
more relations extracted from larger corpora.
28
29. • Datasets:
‒ 100 forum posts about epilepsy treatment
‒ 162 forum posts about cellular company
‒ 200 Facebook posts about automobile company
• Gold standard: human experts identified sentiment expressions from
posts, and labeled each expression and post sentence with target-
specific sentiment.
29
Experiments on Other Social Media Posts
Table: Characteristics of
sentiment expressions
in the Gold Standard
Data Set.
Table: Distribution of
Sentiment Categories of
post sentences in the
Gold Standard Data Set.
30. Results
30
Table: Quality of the extracted sentiment
expressions.
Figure: Sentence-level sentiment
classification accuracy using different
lexicons.
The stable performance on all five datasets provides a strong indication that
the proposed approach is not limited to a specific domain or a specific social
media data source.
32. Aspect-based Opinion Mining
It would be helpful to have an aspect-based opinion summarization for products.
…
Size
picture quality
motion-smoothing
sound quality
big screen perfect size fits big bedroom …
full hd best picture blur reduction …
smooth motion sensor tracing effects …
loud white noise high pitched sound …
32
33. 2. Discovery of Domain-
Specific Features and Aspects
33
Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of
Aspects and Features from Reviews. Proceedings of the 15th Annual Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL),
2016.
Given a set of plain product reviews, how to efficiently
identify (both explicit and implicit) product features
and group them into aspects?
34. Example
Review Sentences
1. Phone is easy to use and has great features. Large
screen is great. Great speed makes smooth
viewing of tv programs or sports.
2. It has a big bright display, it's very fast and very
lightweight for its size.
3. Good features for an inexpensive android, light,
good signal, good sound, pretty quick for a
800MHz processor.
4. The phone runs extra fast and smooth, and has
great price.
Aspects
{screen, display, bright}
{size, large, big}
{lightweight, light}
{price, inexpensive}
{speed, processor, fast,
quick, smooth}
{easy, use}
{features}
{signal}
{sound}
Feature: components and attributes of a product.
• Explicit feature: mentioned as a opinion target
• Implicit feature: implied by opinion words
• Different feature expressions may be used to describe the same aspect
of a product.
Aspect: represented as a group of features 34
35. • Two-step approach: first identifying features, then clustering
them
• Feature Identification
‒ Only extract features but not group them.
‒ Implicit features have been largely ignored.
‒ Require seed terms, hand-crafted rules/patterns, or other annotation
efforts.
• Feature Clustering/Aspect Discovery
‒ Assume that features have been identified beforehand.
‒ Topic-model based approach
o not fine-grained aspects (Zhang and Liu, 2014), not directly
interpretable as aspects (Chen et al., 2013; Bancken et al., 2014),
not good at dealing with aspect sparsity (Xu et al., 2014), etc.
‒ Clustering-based approach (Su et al., 2008; Lu et al., 2009; Bancken et
al., 2014)
Related Work
35
36. Contributions
We propose a new clustering-based approach that:
• identifies both features and aspects simultaneously;
• extracts both explicit and implicit features and groups them into
aspects; and
• does not require seed terms, hand-crafted patterns, or any other
labeling efforts.
36
37. Notation
is a set of
candidate features, which are extracted
from reviews of a given product.
o Candidate of explicit features: noun and
noun phrases
o Candidate of implicit features: adjectives
and verbs
is the number of aspects.
is the number of most
frequent candidates that will be
grouped first to generate the seed
clusters.
is the upper bound of the distance
between two mergeable clusters.
(1) To generate high quality seed clusters:
Frequent terms are more likely the
actual features of customers' interests.
(2) Speed up the process by clustering only
the most frequent ones.
Domain-specific similarity measure:
determine how similar the members in two
clusters are regarding the particular
domain/product.
Merging constraints: further ensure that
the terms from different aspects would not
be merged
The Clustering Algorithm
37
38. • General semantic similarities that are learned from thesaurus
dictionaries or web corpus.
‒ The similarities between words/phrases are domain dependent.
E.g., “ice cream sandwich'' and “operating system” (cell-phone domain)
“smooth” and “speed” (cell-phone domain vs. hair dryer domain)
• Domain-dependent similarities that are learned from a domain-
specific corpus based on distributional information.
‒ Different aspects may share similar context.
E.g., “great display”, “great price”, “great speed”
‒ The words describing the same aspect may not share similar context or
co-occur.
E.g., people use “is inexpensive” or “has great price” instead of “has
inexpensive price”; “running fast” or “great speed” instead of “fast speed”
Similarity Measures
38
39. Domain-specific Similarity
• General similarity matrix G -- a n × n matrix, where Gij is the general semantic
similarity between xi and xj , Gij ∈ [0, 1], Gij = 1 when i=j, and Gij = Gji.
• Use UMBC Semantic Similarity Service to get G.
• Statistical association matrix T -- a n × n matrix, where Tij is the pairwise
statistical association between xi and xj in a domain-specific corpus, Tij ∈ [0,
1], Tij = 1 when i=j, and Tij = Tji.
• Use normalized pointwise mutual information (NPMI) to get T.
39
- f(xi) (or f(xj)) is the number of documents where xi (or xj) appears,
- f(xi, xj) is the number of documents where xi and xj co-occur in a sentence,
- N is the total number of documents in the corpus.
NPMI(xi, xj) ∈ [−1, 1], and we rescale the values of NPMI to the range of [0, 1].
40. • A candidate xi can be represented by the i-th row in G or T.
40
where
• The domain-specific similarity between xi and xj is defined as the weighted
sum of the similarity metrics:
simg captures semantically similar/relevant words,
e.g., “screen” and “display”, “speed” and “fast”.
simt captures words sharing similar context, e.g.,
“ice cream sandwich” and “operating system”.
simgt gets high value when the terms strongly associated with xi (or xj) are
semantically similar to xj (or xi), e.g., “smooth” and “speed”.
Domain-specific Similarity
41. • We evaluate this approach on reviews from three different domains.
• The default setting of CAFE (Clustering for Aspect and Feature Extraction):
‒ The number of aspects k = 50
‒ Distance upper bound 𝛿 = 0.8
‒ The number of candidates that are grouped first to generate seed clusters s = 500
‒ The weights of three similarity measures wg = wt = 0.2, wgt = 0.6
41
Data and Experimental Setting
42. • PROP: A double propagation approach that extracts features using hand-
crafted rules based on dependency relations between features and opinion
words. (Qiu et al., IJCAI’09)
• LRTBOOT: A bootstrapping approach that extracts features by mining
pairwise feature-feature, feature-opinion, opinion-opinion associations
between terms in the corpus, where the association is measured by the
likelihood ratio tests (Hai et al., CIKM’12)
Evaluations on Feature Extraction – Methods
42
44. • MuReinf: A clustering method utilizes the mutual reinforcement
association between features and opinion words to iteratively group them
into feature clusters and opinion clusters. (Su et al., WWW’08)
• L-EM: A semi-supervised learning method that adapts Naive Bayesian-
based EM algorithm to group synonym features into categories. (Zhai et al.,
WSDM’11)
• L-LDA: This is a baseline method used in (Zhai et al., WSDM’11), which is
based on LDA.
* Because MuReinf, L-EM and L-LDA need another algorithm to extract
features, both the LRTBOOT and CAFE is applied.
Evaluations on Aspect Discovery – Methods
44
45. Evaluations on Aspect Discovery – Results
45
The results showed the advantage of combining feature and aspect discovery
over chaining them, and also implied the effectiveness of our domain-specific
similarity measure in identifying synonym features in a particular domain.
46. Influence of Parameters
46
Based on the experiments on three domains, the best results can be achieved when
distance upper bound 𝜹 is set to a value between 0.76 and 0.84.
CAFE generates better results by first clustering the top 10%-30% most frequent
candidates.
The best F-score and Rand Index can be achieved when we set wgt to 0.5 or 0.6 across all
three domains.
48. 3. Harnessing Public Opinion
on Twitter to predict election
results
48
Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User
Groups in Predicting 2012 U.S. Republican Presidential Primaries. Proceedings of the 4th International
Conference on Social Informatics (SocInfo) 2012.
How to derive public opinion about election candidates?
Are opinion holders equal in predicting elections?
49. Overview
49
Tweet ID
candidate: XXX
opinion:
positive
User category:
right-leaning
high engagement
opinion prone
orig. tweet-prone
a user
tweets
network
2. Engagement Degree 4. Tweet Mode
3. Content Type1. Political Preference
Predicting which
candidate this user
support
Aggregating opinions
of each user group to
predict election results
50. Contributions
• We introduce a new method to predict the election results that:
‒ identifies which candidate is mentioned, and whether a positive or
negative opinion is expressed towards a candidate in a tweet;
‒ predicts which candidate a user supports based on the opinions extracted
from his/her tweets; and
‒ aggregates the opinions of all users from a group to predict which
candidate will win the election.
• We show that the opinion holders matter in predicting election
results.
‒ We group users based on their political preference, engagement degree,
tweet mode, and content type, and examine the predictive power of
different user groups in predicting Super Tuesday results in 10 states.
‒ We evaluate the results in terms of both the accuracy of predicting
winners and the error rate between the predicted votes and the actual
votes for each candidate.
50
51. Findings
51
Revealing the challenge of
identifying the opinion of “silent
majority”
Retweets may not necessarily
reflect users' attitude.
Prediction of user’s vote based on
more opinion tweets is not
necessarily more accurate than the
prediction using more information
tweets
The right-leaning user group provides
the most accurate prediction result. In
the best case (56-day time window), it
correctly predict the winners in 8 out
of 10 states with an average
prediction error of 0.1.
52. 4. Religion and Subjective Well-
being
52
Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twitter. Proceedings of
the 6th International Conference on Social Informatics (SocInfo), 2014.
Lu Chen, Ingmar Weber, Adam Okulicz-Kozaryn, and Amit Sheth. Understanding the Effect of Religion
on Happiness by Examining the Topic Preferences and Word Usage on Twitter. (in submission to PLOS
ONE).
How to use Twitter data to measure subjective well-
being? How does the religious belief of users
(holders) affect their happiness expressed in tweets?
53. 53
user’s religious
belief: Buddhism
a user
tweets
network
user ID
happiness_level: ℎ 𝑎𝑣𝑔 𝑢𝑠𝑒𝑟
topic_preference: 𝑝 𝑡𝑜𝑝𝑖𝑐 𝑢𝑠𝑒𝑟
word_preference: 𝑝(𝑤𝑜𝑟𝑑|𝑡𝑜𝑝𝑖𝑐, 𝑢𝑠𝑒𝑟)
Religion: Buddhism
happiness_level: ℎ 𝑎𝑣𝑔 𝑔𝑟𝑜𝑢𝑝
topic_preference: 𝑝 𝑡𝑜𝑝𝑖𝑐 𝑔𝑟𝑜𝑢𝑝
word_preference: 𝑝(𝑤𝑜𝑟𝑑|𝑡𝑜𝑝𝑖𝑐, 𝑔𝑟𝑜𝑢𝑝)
Overview
aggregating the measures of individual
users to obtain the group-level measures
1. What is the effect of religion on happiness?
2. How does topic preference and word usage
affect the happiness expressed by each group?
54. Contributions
• We provide a fresh perspective about happiness and religion,
complementing traditional survey-based studies, via analyzing the
topics and words naturally disclosed in people's social media messages.
• We introduce a framework and methodology that explore the effect of
social and demographic factors of a holder (e.g., a holder’s religious
belief) on subjective well-being.
• Our method also explores potential reasons for the variations in the
level of happiness from the holder’s topic preferences and word usage
on topics.
54
55. Findings
• There is a significant difference among the seven groups (atheist,
Buddhist, Christian, Hindu, Jew, Muslim, and random Twitter users) on
the level of happiness (pleasant/unpleasant emotions) expressed in
tweets.
• Each user group has different topic preferences and different word
usage on the same topic. However, differences on word usage are small
compared with the differences on topic distributions.
• The users' topic preferences strongly correlate with their happiness
expressed in tweets.
55
56. Conclusion
• This dissertation presents a unified framework that characterizes a
subjective experience, such as sentiment, opinion, or emotion, in terms
of an individual holding it, a target eliciting it, a set of expressions
describing it, and a classification or assessment measuring it;
• it describes new algorithms that automatically identify and extract
sentiment expressions and opinion targets from user generated content
with minimal human supervision;
• it shows how to use social media data to predict election results and
investigate religion and subjective well-being, by classifying and
assessing subjective information in user generated content.
56
57. Future Directions
57
Time
1. Detecting different types of
subjectivity in text
2. Beyond sentiment and opinion
3. Towards dynamic modeling of
subjective information.
A subjective experience is a
quintuple , where
t is the time when the subjective
experience occurs.
tcesh ,,,,
58. Publications
• Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of Aspects and Features from
Reviews. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL), 2016. (Acceptance rate: 24%)
• Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twitter. Proceedings of the 6th International
Conference on Social Informatics (SocInfo), 2014. (Acceptance rate: 23%)
• Justin Martineau, Lu Chen, Doreen Cheng and Amit Sheth. Active Learning with Efficient Feature Weighting Methods for Improving
Data Quality and Classification Accuracy. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
(ACL), 2014. (Acceptance rate: 26%)
• Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Cursing in English on Twitter. Proceedings of the 17th ACM
Conference on Computer Supported Cooperative Work and Social Computing (CSCW) 2014. (Acceptance rate: 27%)
• Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, Lu Chen, Hemant Purohit, Alan Smith, and Wenbo Wang. Chapter title: Twitris - A
System for Collective Social Intelligence. Encyclopedia of Social Network Analysis and Mining, 2014.
• D. Cameron, G. A. Smith, R. Daniulaityte, A. P. Sheth, D. Dave, L. Chen, G. Anand, R. Carlson, K. Z. Watkins, R. Falck. PREDOSE: A
Semantic Web Platform for Drug Abuse Epidemiology Using Social Media. Journal of Biomedical Informatics: Special Issue on
Biomedical Information through the Implementation of Social Media Environments. 2013. PMID: 23892295.
• Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012
U.S. Republican Presidential Primaries. Proceedings of the 4th International Conference on Social Informatics (SocInfo) 2012.
(Acceptance rate: 35%)
• Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter "Big Data" for Automatic Emotion
Identification. Proceedings of the 4th ASE/IEEE International Conference on Social Computing (SocialCom), 2012.
• Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sheth. Extracting Diverse Sentiment Expressions with Target-
dependent Polarity from Twitter. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM),
2012. (Acceptance rate: 20%)
• Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit Sheth. Discovering Fine-grained Sentiment in Suicide Notes. Biomedical
Informatics Insights (BII), 2012.
• R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, and A. Sheth. "I Just Wanted to Tell You That Loperamide WILL
WORK": A Web-Based Study of Extra-Medical Use of Loperamide. Journal of Drug and Alcohol Dependence, 2012.
• R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Udayanga, L. Chen, A. Sheth. A Web-based Study of Self-treatment of Opioid
Withdrawal Symptoms with Loperamide. The College on Problems of Drug Dependence (CPDD), 2012.
58
62. Acknowledgement
62
Prof. Amit Sheth
(Advisor)
Dr. Ingmar Weber
(QCRI)
Prof. T.K.Prasad Dr. Justin Martineau
(SRA)
Prof. Keke Chen
Dissertation Committee
Co-authors and Collaborators
Dr. Shaojun Wang
Computer Science
Dr. Meena Nagarajan
(IBM Watson)
Prof. Adam Okulicz-Kozaryn
(Rutgers-Camden)
Dr. Wenbo Wang
(GoDaddy)
Dr. Doreen Cheng
(SRA)
Prof. Raminta Daniulaityte Dr. Delroy Cameron
(Apple)
Dr. Ming Tan
(IBM Watson)
Prof. Valerie Shalin
64. Acknowledgement
This dissertation is based upon work supported by the National
Science Foundation under Grant:
• IIS-1111182 “SoCS: Collaborative Research: Social Media
Enhanced Organizational Sensemaking in Emergency Response”
and
• CNS-1513721 “Context-Aware Harassment Detection on Social
Media.”
64
Editor's Notes
Note that the precision may be worse than the true quality obtainable using a larger corpus, since the gold standards are generated from a subset of tweets.