This paper presents a framework to automatically detect disinformation campaigns on social media. The framework integrates natural language processing, machine learning, graph analysis, and causal inference. It collects social media posts, identifies narratives, classifies accounts as real or influence operations, maps the network of accounts spreading each narrative, and estimates the causal impact of each account in spreading the narrative. The framework was tested on real Twitter data from the 2017 French election and detected influence operation accounts with 96% precision and 79% recall. It also identified communities and high impact accounts that were corroborated by other sources.
Available Data Science M.Sc. Thesis Proposals Marco Brambilla
This document discusses 7 problems related to human-generated content and quantitative analysis. The first problem involves knowledge extraction from data, information, and knowledge. The second problem examines analyzing social spaces using data from services like Foursquare and Flickr. The third problem looks at analyzing social aspects of software development using data from GitHub and developer communities. The fourth problem is applying machine learning to computational social science topics like politics, debates, and societal issues. The fifth problem relates to using knowledge bases to help with content understanding tasks. The sixth problem discusses using digital humanities to engage citizens in envisioning future policies and visions. The seventh problem covers developing data models to help with mobility services.
Iterative knowledge extraction from social networks. The Web Conference 2018Marco Brambilla
Knowledge in the world continuously evolves, and ontologies are largely incomplete, especially regarding data belonging to the so-called long tail. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds.
In this paper we address the following research questions: (1) How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds (2) How does the reconstructed domain knowledge spread geographically (3) Can the method be used to inspect the past, present, and future of knowledge (4) Can the method be used to find emerging knowledge?.
This work was presented at The Web Conference 2018, MSM workshop.
This document proposes using a convolutional neural network (CNN) to detect and classify fake news. It first discusses the implications of fake news spreading on social media and the need for automated identification. It then explores existing fake news datasets and data preprocessing techniques. Deep learning approaches like word embeddings and CNNs are presented as promising techniques to capture semantics in text for classification. The document outlines a CNN architecture with word embedding, convolutional, max pooling and fully connected layers to output probabilities for fake/real classification. It reports the CNN approach achieved 99.8% accuracy on a 2.5GB dataset, significantly outperforming baseline models like SVM and naive bayes. Finally, contact information is provided for questions.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
IRJET- Fake News Detection and Rumour Source IdentificationIRJET Journal
This document discusses methods for detecting fake news and identifying the source of rumors on social media. It proposes using Bayesian classification to classify information into real or fake categories based on the outputs. If the combined outputs from the classes do not match, then the information is considered fake. It also discusses using a reverse dissemination strategy to identify a group of suspects for the original rumor source, rather than examining each individual. This addresses issues with identifying sources. The method aims to identify the source node based on which nodes have accepted the rumor. Machine learning and natural language processing techniques are used to detect fake news from article content.
Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber
Political astroturfing and organised trolling are online malicious behaviours with significant real-world effects. Common approaches examining these phenomena focus on broad campaigns rather than the small groups responsible. To reveal networks of cooperating accounts, we propose a novel temporal window approach that relies on account interactions and metadata alone. It detects groups of accounts engaging in behaviours that, in concert, execute different goal-based strategies, which we describe. Our approach is validated against two relevant datasets with ground truth data. See https://github.com/weberdc/find_hccs for code and data.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining).
Co-authored with Frank Neumann (University of Adelaide)
This paper presents a framework to automatically detect disinformation campaigns on social media. The framework integrates natural language processing, machine learning, graph analysis, and causal inference. It collects social media posts, identifies narratives, classifies accounts as real or influence operations, maps the network of accounts spreading each narrative, and estimates the causal impact of each account in spreading the narrative. The framework was tested on real Twitter data from the 2017 French election and detected influence operation accounts with 96% precision and 79% recall. It also identified communities and high impact accounts that were corroborated by other sources.
Available Data Science M.Sc. Thesis Proposals Marco Brambilla
This document discusses 7 problems related to human-generated content and quantitative analysis. The first problem involves knowledge extraction from data, information, and knowledge. The second problem examines analyzing social spaces using data from services like Foursquare and Flickr. The third problem looks at analyzing social aspects of software development using data from GitHub and developer communities. The fourth problem is applying machine learning to computational social science topics like politics, debates, and societal issues. The fifth problem relates to using knowledge bases to help with content understanding tasks. The sixth problem discusses using digital humanities to engage citizens in envisioning future policies and visions. The seventh problem covers developing data models to help with mobility services.
Iterative knowledge extraction from social networks. The Web Conference 2018Marco Brambilla
Knowledge in the world continuously evolves, and ontologies are largely incomplete, especially regarding data belonging to the so-called long tail. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds.
In this paper we address the following research questions: (1) How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds (2) How does the reconstructed domain knowledge spread geographically (3) Can the method be used to inspect the past, present, and future of knowledge (4) Can the method be used to find emerging knowledge?.
This work was presented at The Web Conference 2018, MSM workshop.
This document proposes using a convolutional neural network (CNN) to detect and classify fake news. It first discusses the implications of fake news spreading on social media and the need for automated identification. It then explores existing fake news datasets and data preprocessing techniques. Deep learning approaches like word embeddings and CNNs are presented as promising techniques to capture semantics in text for classification. The document outlines a CNN architecture with word embedding, convolutional, max pooling and fully connected layers to output probabilities for fake/real classification. It reports the CNN approach achieved 99.8% accuracy on a 2.5GB dataset, significantly outperforming baseline models like SVM and naive bayes. Finally, contact information is provided for questions.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
IRJET- Fake News Detection and Rumour Source IdentificationIRJET Journal
This document discusses methods for detecting fake news and identifying the source of rumors on social media. It proposes using Bayesian classification to classify information into real or fake categories based on the outputs. If the combined outputs from the classes do not match, then the information is considered fake. It also discusses using a reverse dissemination strategy to identify a group of suspects for the original rumor source, rather than examining each individual. This addresses issues with identifying sources. The method aims to identify the source node based on which nodes have accepted the rumor. Machine learning and natural language processing techniques are used to detect fake news from article content.
Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber
Political astroturfing and organised trolling are online malicious behaviours with significant real-world effects. Common approaches examining these phenomena focus on broad campaigns rather than the small groups responsible. To reveal networks of cooperating accounts, we propose a novel temporal window approach that relies on account interactions and metadata alone. It detects groups of accounts engaging in behaviours that, in concert, execute different goal-based strategies, which we describe. Our approach is validated against two relevant datasets with ground truth data. See https://github.com/weberdc/find_hccs for code and data.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining).
Co-authored with Frank Neumann (University of Adelaide)
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.
No misunderstandings during EarthquakesISCRAM 2015
The researchers conducted a study to develop a standardized tweet format for Italy's national earthquake monitoring institute (INGV) to use when automatically detecting earthquakes on Twitter. A survey of over 1,000 people found that they wanted timely earthquake information from INGV on Twitter. Based on the results, the researchers proposed a new tweet format including provisional magnitude, location, and time in local format. The standardized format was approved by INGV to help effectively communicate automatic earthquake detection information on Twitter.
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Sameera Horawalavithana
The document discusses research into quantifying the relationship between a graph's properties and its vulnerability to deanonymization attacks. It presents three research questions: 1) How topological properties affect attacks, 2) How node attribute placement affects vulnerability, and 3) How diffusion processes impact vulnerability. The methodology section outlines generating synthetic and real-world graphs, modeling attacks, and measuring success. Key findings include some topological properties like transitivity and assortativity impacting privacy independent of degree distribution. Node attribute diversity increases vulnerability more than attribute homophily. Faster spreading diffusions see higher vulnerability growth. The implications are discussed for data owners and privacy researchers.
Research @ the Social Media Lab (@SMLabTO)Philip Mai
The Social Media Lab at Ryerson University studies how social media can support online communities and political engagement. It develops data analytics software like Netlytic and MyTweeps to analyze social media conversations and visualize social networks. The lab has expertise in social network analysis, information visualization, and computer-mediated communication. It is directed by one director and includes postdocs, PhD students, master's students, undergraduates, and faculty collaborators from various institutions.
A method to evaluate the reliability of social media data for social network ...Derek Weber
In order to study the effects of Online Social Network (OSN) activity on real-world offline events, researchers need access to OSN data, the reliability of which has particular implications for social network analysis. This relates not only to the completeness of any collected dataset, but also to constructing meaningful social and information networks from them. In this multidisciplinary study, we consider the question of constructing traditional social networks from OSN data and then present a measurement case study showing how the reliability of OSN data affects social network analyses. To this end we developed a systematic comparison methodology, which we applied to two parallel datasets we collected from Twitter. We found considerable differences in datasets collected with different tools and that these variations significantly alter the results of subsequent analyses. Our results lead to a set of guidelines for researchers planning to collect online data streams to infer social networks.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining)
Co-authors: Mehwish Nasim (Data61 / CSIRO), Lewis Mitchell (University of Adelaide), Lucia Falzon (University of Melbourne / DST Group)
Myths and challenges in knowledge extraction and analysis from human-generate...Marco Brambilla
For centuries, science (in German "Wissenschaft") has aimed to create ("schaften") new knowledge ("Wissen") from the observation of physical phenomena, their modelling, and empirical validation. Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted. The challenge is once again to capture and create knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data). The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.
The talk explores the mixed realistic-utopian domain of knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.
Big data analysis of news and social media contentFiras Husseini
This document summarizes research from the Intelligent Systems Laboratory at the University of Bristol on analyzing large amounts of news and social media content using computational methods. It discusses several studies, including analyzing over 400 million tweets to track public mood in the UK, extracting narrative networks from over 125,000 news articles about the 2012 US elections, comparing differences between news outlets and their topics/writing styles using machine learning, modeling the EU news media network using clustering and translation techniques, and predicting popular news articles based on their content. The research demonstrates how computational social science can reveal patterns in large datasets that were previously impossible to analyze at scale.
This document provides an initial literature review and research proposal for a project analyzing social media data from Twitter to predict crime rates. The research aims to identify evidence of criminal activity by analyzing geotagged tweets and visualizing crime hotspots. Challenges include issues with jurisdiction when crimes span multiple countries and limitations of current UK law in addressing online crimes. The methodology will use both traditional legal research and interdisciplinary approaches. The literature review discusses prior research on crime prediction methods and limitations, as well as the potential for social media data to analyze crimes.
Hate speech has been an ongoing problem on the Internet for many years. Besides, social media, especially Facebook, and Twitter have given it a global stage where those hate speeches can spread far more rapidly. Every social media platform needs to implement an effective hate speech detection system to remove offensive content in real-time. There are various approaches to identify hate speech, such as Rule-Based, Machine Learning based, deep learning based and Hybrid approach. Since this is a review paper, we explained the valuable works of various authors who have invested their valuable time in studying to identifying hate speech using various approaches.
This document summarizes a research study analyzing the content of tweets sent by politicians from several countries including Korea, the UK, US and Canada. The researchers gathered Twitter data for the politicians using an API tool and conducted a preliminary content analysis of tweets from Korean politicians. They categorized the tweets into three types: socio-political messages about events and policies, private personal messages, and conversational messages intended for others using the "@" feature. The results of their content analysis on 878 tweets from Korean politicians found statistically significant differences in the frequencies of these three tweet types.
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERINGijaia
Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety,
Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.
Social media produces vast amounts of user data that could potentially be used for official statistics. However, most individual social media messages are "noisy" and not directly relevant. Researchers at CBS are developing methods to filter relevant messages from different social media platforms about topics like sentiment, social tensions, and migration intentions. They are also working to characterize social media users and account for differences from the general population in order to use these sources to produce new real-time indicators.
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
1) The document discusses three types of using big data in statistics: (1) combined with survey data, (2) from a single complete source, and (3) from a single incomplete source.
2) Examples of type 2 include road sensor traffic data and web-scraped price data. These sources completely cover their target populations.
3) Examples of type 3 include social media data and mobile phone data. Only part of the target population is included, so ways must be found to deal with the missing part, such as determining the characteristics of the included population.
Dealing with Information Overload When Using Social Media for Emergency Manag...Mirjam-Mona
Presentation of Starr Roxanne Hiltz and Linda P. Plotnick on the topic "Dealing with Information Overload When Using Social Media for Emergency Management: Emerging Solutions" at ISCRAM2013
Rapid identification of new drugs through online monitoring tools: The case o...Australian Drug Foundation
The rapid proliferation of new drugs available to Australians has necessitated the use of innovative techniques to monitor their emergence. In this presentation, Monica uses the example of NBOMe drugs (reportedly sold as
‘legal LSD’) to outline 4 ways of monitoring drug use trends online and in real-time (Google Trends, drug user forums, Twitter, and Silk Road). These tools are freely available for use by clinicians, AOD workers and researchers who are seeking further information about new drugs presented by clients, or that are talked about in their work.
On How the Darknet and its Access to SCADA is a Threat to National Critical I...Matthew Kurnava
This document analyzes how the darknet poses a threat to national critical infrastructure. It begins with an introduction that defines the darknet and describes some of the illegal activities that occur there. The research question asks how the darknet threatens critical infrastructure and how vulnerable different sectors are. The hypothesis is that the darknet poses a primary threat to US cyber critical infrastructure due to criminal, hacktivist, and terrorist use that could significantly damage health and welfare. A literature review discusses research on darknet cyber attacks, hacktivist and terrorist groups using the darknet, and critical infrastructure's growing dependency on technology and vulnerability. The methodology will use an analytical approach to examine threats to each of the 16 US critical infrastructure sectors.
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
The document outlines key aspects of research methodology including:
1. The objectives of research such as defining problems, formulating hypotheses, collecting and evaluating data, making deductions, and testing conclusions.
2. The different types of research including descriptive, applied, quantitative, conceptual, empirical, qualitative, fundamental, and analytical research.
3. The methods of collecting data including primary methods like questionnaires, observations, interviews, and schedules and secondary methods of collecting published and unpublished data from various sources.
Civic Exchange - 2009 The Air We Breathe Conference - Application of Studies ...Civic Exchange
Civic Exchange 2009 The Air We Breathe Conference - Experts Symposium 9 January 2009
Application of Studies on Health Effects of Air Pollution in Hong Kong
presented by Dr CM Wong (Department of Community Medicine and School of Public Health, The University of Hong Kong)
http://air.dialogue.org.hk
This document discusses different types of pollution including air, water, noise, land, and radioactive pollution. It defines each type of pollution, discusses their causes and effects, and provides suggestions for prevention. The document shares information through text and images to raise awareness about various forms of pollution and their impacts on the environment.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.
No misunderstandings during EarthquakesISCRAM 2015
The researchers conducted a study to develop a standardized tweet format for Italy's national earthquake monitoring institute (INGV) to use when automatically detecting earthquakes on Twitter. A survey of over 1,000 people found that they wanted timely earthquake information from INGV on Twitter. Based on the results, the researchers proposed a new tweet format including provisional magnitude, location, and time in local format. The standardized format was approved by INGV to help effectively communicate automatic earthquake detection information on Twitter.
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Sameera Horawalavithana
The document discusses research into quantifying the relationship between a graph's properties and its vulnerability to deanonymization attacks. It presents three research questions: 1) How topological properties affect attacks, 2) How node attribute placement affects vulnerability, and 3) How diffusion processes impact vulnerability. The methodology section outlines generating synthetic and real-world graphs, modeling attacks, and measuring success. Key findings include some topological properties like transitivity and assortativity impacting privacy independent of degree distribution. Node attribute diversity increases vulnerability more than attribute homophily. Faster spreading diffusions see higher vulnerability growth. The implications are discussed for data owners and privacy researchers.
Research @ the Social Media Lab (@SMLabTO)Philip Mai
The Social Media Lab at Ryerson University studies how social media can support online communities and political engagement. It develops data analytics software like Netlytic and MyTweeps to analyze social media conversations and visualize social networks. The lab has expertise in social network analysis, information visualization, and computer-mediated communication. It is directed by one director and includes postdocs, PhD students, master's students, undergraduates, and faculty collaborators from various institutions.
A method to evaluate the reliability of social media data for social network ...Derek Weber
In order to study the effects of Online Social Network (OSN) activity on real-world offline events, researchers need access to OSN data, the reliability of which has particular implications for social network analysis. This relates not only to the completeness of any collected dataset, but also to constructing meaningful social and information networks from them. In this multidisciplinary study, we consider the question of constructing traditional social networks from OSN data and then present a measurement case study showing how the reliability of OSN data affects social network analyses. To this end we developed a systematic comparison methodology, which we applied to two parallel datasets we collected from Twitter. We found considerable differences in datasets collected with different tools and that these variations significantly alter the results of subsequent analyses. Our results lead to a set of guidelines for researchers planning to collect online data streams to infer social networks.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining)
Co-authors: Mehwish Nasim (Data61 / CSIRO), Lewis Mitchell (University of Adelaide), Lucia Falzon (University of Melbourne / DST Group)
Myths and challenges in knowledge extraction and analysis from human-generate...Marco Brambilla
For centuries, science (in German "Wissenschaft") has aimed to create ("schaften") new knowledge ("Wissen") from the observation of physical phenomena, their modelling, and empirical validation. Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted. The challenge is once again to capture and create knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data). The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.
The talk explores the mixed realistic-utopian domain of knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.
Big data analysis of news and social media contentFiras Husseini
This document summarizes research from the Intelligent Systems Laboratory at the University of Bristol on analyzing large amounts of news and social media content using computational methods. It discusses several studies, including analyzing over 400 million tweets to track public mood in the UK, extracting narrative networks from over 125,000 news articles about the 2012 US elections, comparing differences between news outlets and their topics/writing styles using machine learning, modeling the EU news media network using clustering and translation techniques, and predicting popular news articles based on their content. The research demonstrates how computational social science can reveal patterns in large datasets that were previously impossible to analyze at scale.
This document provides an initial literature review and research proposal for a project analyzing social media data from Twitter to predict crime rates. The research aims to identify evidence of criminal activity by analyzing geotagged tweets and visualizing crime hotspots. Challenges include issues with jurisdiction when crimes span multiple countries and limitations of current UK law in addressing online crimes. The methodology will use both traditional legal research and interdisciplinary approaches. The literature review discusses prior research on crime prediction methods and limitations, as well as the potential for social media data to analyze crimes.
Hate speech has been an ongoing problem on the Internet for many years. Besides, social media, especially Facebook, and Twitter have given it a global stage where those hate speeches can spread far more rapidly. Every social media platform needs to implement an effective hate speech detection system to remove offensive content in real-time. There are various approaches to identify hate speech, such as Rule-Based, Machine Learning based, deep learning based and Hybrid approach. Since this is a review paper, we explained the valuable works of various authors who have invested their valuable time in studying to identifying hate speech using various approaches.
This document summarizes a research study analyzing the content of tweets sent by politicians from several countries including Korea, the UK, US and Canada. The researchers gathered Twitter data for the politicians using an API tool and conducted a preliminary content analysis of tweets from Korean politicians. They categorized the tweets into three types: socio-political messages about events and policies, private personal messages, and conversational messages intended for others using the "@" feature. The results of their content analysis on 878 tweets from Korean politicians found statistically significant differences in the frequencies of these three tweet types.
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERINGijaia
Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety,
Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.
Social media produces vast amounts of user data that could potentially be used for official statistics. However, most individual social media messages are "noisy" and not directly relevant. Researchers at CBS are developing methods to filter relevant messages from different social media platforms about topics like sentiment, social tensions, and migration intentions. They are also working to characterize social media users and account for differences from the general population in order to use these sources to produce new real-time indicators.
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
Isi 2017 presentation on Big Data and biasPiet J.H. Daas
1) The document discusses three types of using big data in statistics: (1) combined with survey data, (2) from a single complete source, and (3) from a single incomplete source.
2) Examples of type 2 include road sensor traffic data and web-scraped price data. These sources completely cover their target populations.
3) Examples of type 3 include social media data and mobile phone data. Only part of the target population is included, so ways must be found to deal with the missing part, such as determining the characteristics of the included population.
Dealing with Information Overload When Using Social Media for Emergency Manag...Mirjam-Mona
Presentation of Starr Roxanne Hiltz and Linda P. Plotnick on the topic "Dealing with Information Overload When Using Social Media for Emergency Management: Emerging Solutions" at ISCRAM2013
Rapid identification of new drugs through online monitoring tools: The case o...Australian Drug Foundation
The rapid proliferation of new drugs available to Australians has necessitated the use of innovative techniques to monitor their emergence. In this presentation, Monica uses the example of NBOMe drugs (reportedly sold as
‘legal LSD’) to outline 4 ways of monitoring drug use trends online and in real-time (Google Trends, drug user forums, Twitter, and Silk Road). These tools are freely available for use by clinicians, AOD workers and researchers who are seeking further information about new drugs presented by clients, or that are talked about in their work.
On How the Darknet and its Access to SCADA is a Threat to National Critical I...Matthew Kurnava
This document analyzes how the darknet poses a threat to national critical infrastructure. It begins with an introduction that defines the darknet and describes some of the illegal activities that occur there. The research question asks how the darknet threatens critical infrastructure and how vulnerable different sectors are. The hypothesis is that the darknet poses a primary threat to US cyber critical infrastructure due to criminal, hacktivist, and terrorist use that could significantly damage health and welfare. A literature review discusses research on darknet cyber attacks, hacktivist and terrorist groups using the darknet, and critical infrastructure's growing dependency on technology and vulnerability. The methodology will use an analytical approach to examine threats to each of the 16 US critical infrastructure sectors.
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
The document outlines key aspects of research methodology including:
1. The objectives of research such as defining problems, formulating hypotheses, collecting and evaluating data, making deductions, and testing conclusions.
2. The different types of research including descriptive, applied, quantitative, conceptual, empirical, qualitative, fundamental, and analytical research.
3. The methods of collecting data including primary methods like questionnaires, observations, interviews, and schedules and secondary methods of collecting published and unpublished data from various sources.
Civic Exchange - 2009 The Air We Breathe Conference - Application of Studies ...Civic Exchange
Civic Exchange 2009 The Air We Breathe Conference - Experts Symposium 9 January 2009
Application of Studies on Health Effects of Air Pollution in Hong Kong
presented by Dr CM Wong (Department of Community Medicine and School of Public Health, The University of Hong Kong)
http://air.dialogue.org.hk
This document discusses different types of pollution including air, water, noise, land, and radioactive pollution. It defines each type of pollution, discusses their causes and effects, and provides suggestions for prevention. The document shares information through text and images to raise awareness about various forms of pollution and their impacts on the environment.
This document summarizes a mini project on air and noise pollution in and around industrial areas. It describes surveys conducted at busy traffic places and factories to study pollution levels. The effects of air pollution include health issues like hearing impairment and lung diseases. Noise pollution can cause lack of concentration, heart attacks, and mental illness. Suggested strategies to mitigate noise pollution include using noise barriers, limiting vehicle speeds, and controlling traffic flow.
Air Pollution, Asthma, Triggers & Health - Research and Remediation StrategiesSean McCormick
This content was created to help provide health care practitioners with more detailed information about air pollution, it's impact on health, and low-no-cost strategies for reducing exposure to asthma triggers.
Air, water, and land pollution were discussed. Air pollution comes from natural sources like volcanoes and human sources such as factories and cars. Major air pollutants include carbon monoxide, sulfur dioxide, and particulate matter which can cause health issues. Water pollution comes from point sources like factories and non-point sources like agricultural runoff. Land pollution is caused by construction, agriculture, and domestic and industrial waste. Pollution has consequences like acid rain, smog, and damage to plants and wildlife. Reducing pollution requires efforts from individuals, industries, and governments.
Academic Stress For Management StudentsLatha setna
This document discusses stress among MBA students in India. It identifies several stressors for MBA students, including academic pressures, faculty behavior, relationships with other students, and placement concerns. A study was conducted of 100 MBA students which found that placement issues were the biggest stressor, followed by faculty, other students, and management. The document provides tips for students to better manage their stress through routines, breaks, relaxation, and time management.
This document summarizes a study on the impact of academic stress on MBA students of Gujarat Technological University. The study aimed to identify components of academic stress, including curriculum/instruction, teamwork, assessments, and placement. It surveyed 118 MBA students across Gujarat. The results showed that curriculum/instruction and lack of recreational time highly impacted stress levels. Behavioral stressors like cultural effects also impacted performance. Common outcomes of stress included headaches, sleep issues, nervousness and mood changes. The study provides insight into the sources and effects of academic stress on MBA students.
This document presents an overview of air pollution monitoring using remote sensing and GIS technologies. It discusses how satellite remote sensing can provide synoptic views of large areas and monitor multiple pollutants simultaneously. It also describes some common air pollutants and sources. Two case studies are then presented on using these methods to map ambient air pollution zones and monitor air quality in specific regions.
Dr. B. Victor presented on air pollution. He discussed different types of pollution sources and air pollutants. Some key effects of air pollution include damage to health, vegetation, and structures. Increased carbon dioxide contributes to global warming and climate change through the greenhouse effect. Air pollutants like sulfur dioxide and nitrogen oxides can cause acid rain when dissolved in water, harming aquatic life and soil.
This presentation discusses various types of air pollution including smog, acid rain, the greenhouse effect, and holes in the ozone layer. It notes that air pollution comes from both outdoor sources like vehicle emissions and indoor sources like smoking. The health effects of air pollution can be both short-term and long-term. Prevention efforts focus on reducing waste and changing lifestyles to be more environmentally friendly.
The document discusses air pollution, including its definition, types, causes, effects, and prevention. It defines air pollution as physical, chemical, and biological agents that modify the natural atmosphere. It discusses primary and secondary pollutants like carbon monoxide and ozone. Major causes of air pollution include vehicle emissions, industrial emissions, and natural sources like wildfires. Short-term effects include respiratory issues, while long-term effects involve chronic diseases like lung cancer and heart disease. Prevention strategies include controlling vehicle and industry emissions, restricting smoking, and increasing ventilation.
This document summarizes research methodology and design. It discusses types of research including pure and applied research as well as qualitative and quantitative research. It also outlines the research process including formulating research questions, developing a research proposal, and designing the research. The design considerations covered include design strategy, data collection methods, sampling, and pilot testing. It also discusses research ethics and characteristics of sound research.
The document discusses research methodology and defines research. It provides examples of what constitutes research and what does not. Research is defined as a systematic, logical process that includes understanding the problem, reviewing literature, collecting and analyzing data, drawing conclusions, and generalizing findings. The document also discusses types of research questions, purposes of research, and common challenges in conducting research.
Air pollution: its causes,effects and pollutantsMaliha Eesha
This presentation gives the complete detail of air, air pollution, air pollutants and their types, each pollutant in detail and its causes and effects, acid rain, methods of prevention,smog,acidification,indoor pollution and so on. It is a complete package and I hope it'll be helpful in school! :)
A RESEARCH ON EFFECT OF STRESS AMONG KMPh STUDENTS Natrah Abd Rahman
Stress is the feeling that is created when we react to particular events. It can make you feel threatened or upset. It is a combination of psychological, physiological and behavioral reactions that people have in response to events that threaten or challenge them.
This document discusses different types of pollution including air, water, noise, land, and radioactive pollution. It provides definitions and overviews of each type of pollution, describes their causes and effects, and gives recommendations for prevention. The types of pollution covered are air pollution from industries and vehicles, water pollution from industrial and sewage waste, noise pollution from traffic, construction and airports, land pollution from mining, garbage and industrial waste, and radioactive pollution from nuclear power plants and waste. The document aims to educate about various forms of pollution and their impacts.
Pollution is the introduction of contaminants into the natural environment that cause adverse effects. It discusses various types of pollution like air, water, soil, noise, and light pollution. The document outlines causes like industries, vehicles, and agriculture. Effects include health impacts and ecosystem damage. It provides measures to control different types of pollution such as treating wastes, using public transport, and limiting fertilizers. The most polluted world cities include Cairo, Delhi, and Beijing. The conclusion is that reducing pollution requires going green.
This document defines air pollution as occurring when air contains harmful amounts of gases, dust, fumes or odors. It discusses outdoor sources like smog and indoor sources like burning wood. Natural sources include wildfires and volcanoes, while human sources are things like vehicles, power plants, and burning wood. Air pollution can cause health issues for humans and environmental effects like acid rain. The document recommends mitigating air pollution through sustainable development, international agreements, and new technologies.
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
talk for paper published at ICWE2019:
Primo F, Missier P, Romanovsky A, Mickael F, Cacho N. A customisable pipeline for continuously harvesting socially-minded Twitter users. In: Procs. ICWE’19. Daedjeon, Korea; 2019.
IRJET- Sentiment Analysis using Machine LearningIRJET Journal
This document discusses sentiment analysis of social media posts using machine learning. The authors aim to classify social media posts as having either a political or non-political sentiment. A dictionary of keywords and their sentiments is created to analyze posts. Users can make posts that are then classified, and an admin can hide or delete posts containing harmful keywords. Graphs are generated to analyze the classification of posts as political versus non-political and by sentiment. The accuracy of classification depends on the training data and dictionary.
In the age of social media communication, it is easy to
modulate the minds of users and also instigate violent
actions being taken by them in some cases. There is a need
to have a system that can analyze the threat level of tweets
from influential users and rank their Twitter handles so
that dangerous tweets can be avoided going public on
Twitter before fact-checking which can hurt the sentiments
of people and can take the shape of violence. The study
aims to analyse and rank twitter users according to their
influential power and extremism of their tweets to help
prevent major protests and violent events. We scraped top
trending topics and fetched tweets using those hashtags.
We propose a custom ranking algorithm which considers
source based and content based features along with a
knowledge graph which generates the score and rank the
twitter users according to the scores. Our aim with this
study is to identify and rank extremist twitter users with
regards to their impact and influence. We use a technique
that takes into consideration both source based and
content-based features of tweets to generate the ranking of
the extremist twitter users having a high impact factor
This project analyses the relations on Twitter between politicians and journalists in the triangle of political communication in a hybrid media system (Chadwick, 2013).
From Research to Applications: What Can We Extract with Social Media Sensing?Yiannis Kompatsiaris
SIGMAP22 Keynote Presentation:
Social media have transformed the Web into an interactive sharing platform where users upload data and media, comment on, and share this content within their social circles. The large-scale availability of user-generated content in social media platforms has opened up new possibilities for studying and understanding real-world phenomena, trends and events. Social media and websites provide an access to public opinions on certain aspects and therefore play an important role in getting insights on targeted audiences. The objective of this talk is to provide an overview of social media mining, including key aspects such as data collection, multimodal analysis and visualization. Challenges, such as fighting misinformation, will be presented together with applications, results and demonstrations from multiple areas including: news, environment, security, interior and urban design.
DigiCCurr 2013 PhD Workshop - Citizen Science and Data Curation: Who needs what?Todd Suomela
Todd Suomela's dissertation will examine issues related to digital curation of data from citizen science projects. It will focus on identifying stakeholders in citizen science, understanding their data curation needs and awareness, and how information scientists can help meet those needs. The study overlays potential citizen science stakeholders on the DataONE data lifecycle model to explore their roles and concerns at different stages of data management.
Researching Social Media – Big Data and Social Media AnalysisFarida Vis
Researching Social Media – Big Data and Social Media Analysis, presentation for the Social Media for Researchers: A Sheffield Universities Social Media Symposium, 23 September 2014
Cottbus Brandenburg University of Technology Lecture series on Smart RegionsCritically Assembling Data, Processes & Things: Toward and Open Smart CityJune 5, 2018
This lecture will critically focus on smart cities from a data based socio-technological assemblage approach. It is a theoretical and methodological framework that allows for an empirical examination of how smart cities are socially and technically constructed, and to study them as discursive regimes and as a large technological infrastructural systems.
The lecture will refer to the research outcomes of the ERC funded Programmable City Project led by Rob Kitchin at Maynooth University and will feature examples of empirical research conducted in Dublin and other Irish cities.
In addition, the lecture will discuss the research outcomes of the Canadian Open Smart Cities project funded by the Government of Canada GeoConnections Program. Examples will be drawn from five case studies namely about the cities of Edmonton, Guelph, Ottawa and Montreal, and the Ontario Smart Grid as well as number of international best practices. The recent Infrastructure Canada Canadian Smart City Challenge and the controversial Sidewalk Lab Waterfront Toronto project will also be discussed.
It will be argued that no two smart cities are alike although the technological solutionist and networked urbanist approaches dominate and it is suggested that these kind of smart cities may not live up to the promise of being better places to live.
In this lecture, the ideals of an Open Smart City are offered instead and in this kind of city residents, civil society, academics, and the private sector collaborate with public officials to mobilize data and technologies when warranted in an ethical, accountable and transparent way in order to govern the city as a fair, viable and livable commons that balances economic development, social progress and environmental responsibility. Although an Open Smart City does not yet exist, it will be argued that it is possible.
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
This document describes a study that uses community detection models to identify prevalent news topics discussed on both Twitter and traditional media like BBC. It collects tweets and news articles about sports over a one-month period. Keywords are extracted from the data and a graph is constructed to represent relationships between words. Three community detection models - Girvan-Newman clustering, CLIQUE, and Louvain - are used to cluster similar content and detect communities of keywords representing news topics. The number of unique Twitter users engaged with each topic is also calculated to rank topics by user attention. The goal is to analyze how information is distributed between social and traditional media and identify emerging topics with low coverage in traditional sources.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
International Journal of Computer Science, Engineering and Applications (IJCSEA)IJCSEA Journal
International Journal of Computer Science, Engineering and Applications (IJCSEA) is an open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer science, Engineering and Applications. The journal is devoted to the publication of high quality papers on theoretical and practical aspects of computer science, Engineering and Applications.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
This document summarizes several research papers that used social network analysis on Twitter data related to COVID-19. The papers analyzed hashtags, retweets, mentions and conversations to understand public debates and information spread about topics like conspiracy theories, medical news, and public responses in different countries. The studies identified influential users, common discussions, and how social media could provide insights into managing pandemic situations.
Information Contagion through Social Media: Towards a Realistic Model of the ...Axel Bruns
Paper by Axel Bruns, Patrik Wikström, Peta Mitchell, Brenda Moon, Felix Münch, Lucia Falzon, and Lucy Resnyansky presented at the ACSPRI 2016 conference, Sydney, 19-22 July 2016/
A framework for real time semantic social media analysis Zelia Blaga
This document presents a framework for real-time semantic analysis of social media content, specifically focused on UK political tweets. The framework collects tweets using the Twitter API, processes them using natural language processing tools to extract entities, topics, and sentiment, and indexes the output using GATE Mimir for semantic search and visualization. It was tested on over 1.8 million tweets related to the 2015 UK general election, extracting useful political insights on discussed topics, sentiment toward parties, and how engagement varied by issue. The framework enables complex queries over annotated text and correlates data to provide summaries and predictive analytics of social media discussions.
Combating propaganda texts using transfer learningIAESIJAI
Recently, it has been observed that people are shifting away from traditional news media sources towards trusting social networks to gather news information. Social networks have become the primary news source, although the validity and reliability of the information provided are uncertain. Memes are crucial content types that are very popular among young people and play a vital role in social media. It spreads quickly and continues to spread rapidly among people in a peer-to-peer manner rather than a prescriptive. Unfortunately, promoters and propagandists have adopted memes to indirectly manipulate public opinion and influence their attitudes using psychological and rhetorical techniques. This type of content could lead to unpleasant consequences in communities. This paper introduces an ensemble model system that resolves one of the most recent natural language processing research topics; propaganda techniques detection in texts extracted from memes. The paper also explores state-of-the-art pre-trained language models. The proposed model also uses different optimization techniques, such as data augmentation and model ensemble. It has been evaluated using a reference dataset from SemEval-2021 task 6. Our system outperforms the baseline and state-of-the-art results by achieving an F1-micro score of 0.604% on the test set.
Use of ICT in Research Writing: Tools and Technologyssuser1310d0
This document outlines the role of information and communication technologies (ICT) in research. It discusses ICT factors affecting research, challenges, and skills needed by researchers. The seminar aims to help researchers understand how to locate, access, and evaluate information, and identify relevant tools to support research work. ICT consists of hardware, software, networks, and media that can change the research process. Researchers must develop skills to manage materials, analyze data, communicate, write academically, and disseminate knowledge using ICT tools.
This document provides a review of techniques, tools, and platforms for analyzing social media data. It discusses the types of social media data and formats available, as well as tools for accessing, cleaning, analyzing, and visualizing social media data. Some key challenges of social media research are the restricted access to comprehensive data sources, lack of tools for in-depth analysis without programming, and need for large data storage and computing facilities to support research at scale. The document provides a methodology and critique of current approaches and outlines requirements to better support social media research.
understanding the pandemic through mining covid news using natural language p...Kishor Datta Gupta
This document summarizes a research presentation on analyzing Covid-19 news reports from newspapers in developed and developing countries using natural language processing. It introduces the research aim to understand how newspapers portray the pandemic using NLP techniques on reports from the US and Bangladesh. The researchers collected over 1000 news articles to create the NNK Dataset, which they preprocessed and analyzed to extract keywords, sentiments, and case numbers. Word clouds of frequent terms and numeric extractions showed how coverage evolved over time. The dataset was made publicly available to encourage further analysis of portraying pandemics through newspapers.
Similar to Geographic knowledge discovery (PhD Theme) by Roberto Zagal (20)
Securing BGP: Operational Strategies and Best Practices for Network Defenders...APNIC
Md. Zobair Khan,
Network Analyst and Technical Trainer at APNIC, presented 'Securing BGP: Operational Strategies and Best Practices for Network Defenders' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...APNIC
Adli Wahid, Senior Internet Security Specialist at APNIC, delivered a presentation titled 'Honeypots Unveiled: Proactive Defense Tactics for Cyber Security' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
Geographic knowledge discovery (PhD Theme) by Roberto Zagal
1. Geographical Knowledge
Discovery
applied to the
Social Perception of Pollution
in Mexico City
Roberto Zagal,Instituto Politecnico Nacional, ESCOM-IPN
Felix Mata, Instituto Politecnico Nacional, UPIITA-IPN
Christophe Claramunt, Naval Academy Research Institute
1
5. Introduction (4)
What about ...
inconsistency?
Id Type Description
1 Tweet
newspaper1
The index of IMECAS is 135 #CDMX
2 Tweet
Newspaper2
@ the #contamination of air is 127 IMECAS
#CDMX #bad #new
6. Related work
•
The social data problem has been faced:
1. KDD and Social Mining
2. Formal publications (news media) guide the classification
of the interests of social media users [1]
3. Opinion mining and topic modeling [2].
But not using a GKD with an approach of crossing data
layers
6
7. Goal
Know how to:
Discover the certainty level of information
by
Crossing geographic and social information
7
9. Data extraction: Sample tweet (Phase 1)
9
Id Type Description
1 Tweet
newspaper1
TheThe index of IMECAS is 135 #CDMX
2 Tweet
Newspaper2
@ the #contamination of air is 127 IMECAS
#CDMX #bad #news
We consider tweets from accounts that periodically
reports data of air pollution
10. Data extraction: Domain Detection
(Phase 1)
10
Id Type Description
2 Tweet
Newspape
r2
@ #contamination air is
127 IMECAS #CDMX #bad
#new
The post is related to a pollution topic
11. Preprocessing (Phase 2)
•
Emotion detection [3]
•
Location extraction
11
Id Type Description
2 Tweet
Newspaper2
@ #contamination air is 127 IMECAS #CDMX
#bad #new
12. •
If we detect to which category belongs each set of data:
•
Health and Pollution, Transport and Pollution
Then, we can select which data sources should beThen, we can select which data sources should be
crossed with the tweet , in order to discovercrossed with the tweet , in order to discover
KnowledgeKnowledge
12
Classification C5 algorithm (Phase 3)
Id Description Category
2 @ #contamination air is 127 IMECAS
#CDMX #bad #new
Health and
pollution
13. Crossing data (Phase 4)
•
Example 1:
•
Inconsistencies in tweet 1 and 2?
13
Id Type Description
1 Tweet
Newspaper1
The index of IMECAS is 135 #CDMX
2 Tweet
Newspaper2
@ the #contamination of air is 127 IMECAS
#CDMX
What is correct?
14. How to know what tweet is correct?
Answer:
It was classified in the domain of:
Health and pollution ( In Phase 3 )
Then
The official data from Healt reports and pollution reports are
selected to be crosssed with the Tweet (in Phase 4)
28/10/16
Crossing data (Phase 4)
15. Crossing data (Phase 4)
• Data are crossed considering different attributes,
from the tweet is taken the date and hour of
publication
• When is crossed with the date and hour from
official reports of air quality: a match is found
28/10/16
16. We discovered the tweets are correct but with
different location (the location is not include in
the original tweet)
28/10/16
1 Tweet
newspaper1
The index of IMECAS is in
135 #CDMX
#Taxqueña 10:00
hours
2 Tweet
Newspaper2
The #contaminación of air
is in 127 IMECAS #CDMX
#Indios
Verdes
15:00
hours
Knowledge
Discovered!
Crossing data (Phase 4)
17. Other preliminary results
•
Following the same approach
•
Knowledge discovered: what topic are talked by region
17
Topic Geographic Period
Health
South , West March-June
Transport
North, East January
December
Policy and
programs
Center January
December
Pollution
Surrounding Mexico City January-June
Public roads
Surrounding Mexico City January-
December
18. Conclusions and Future work
•
The integration of the geographical and temporal
dimensions allow us to discover data correlations
knowledge can increase certainty of some
information in social networks .
•
The main contribution is the domain discovery and
classification of information is a key element of news
aproaches for to discover geographic information.
18
19. Conclusions and future work
•
Future work
•
Use of clustering or deep learning approaches to improve the
classification process
•
The location detection is a hard problem. It can be test another
machine learning methods for social media [4, 5]
•
¿How can we improve the geographic discovery knowledge
considering no explicit links between traditional data sources and
social sources?
19
21. References
[1] Jonghyun Han, Hyunju Lee, Characterizing the interests of social media users: Refinement of a topic model for
incorporating heterogeneous media, Information Sciences, Volumes 358–359, 1 September 2016, Pages 112-128, ISSN
0020-0255.
[2] Schubert, E., Weiler, M., & Kriegel, H. P. (2014, August). Signitrend: scalable detection of emerging topics in textual
streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 871-880). ACM.
architecture for analysis of feelings in Facebook with semantic approach (Spanish), pp. 59–69; rec. 2014-06-22; acc.
2014-07-21 59 Research in Computing Science 75 (2014). http://www.rcs.cic.ipn.mx/rcs/2014_75/
[4] Ting Hua, Liang Zhao, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2016. How events unfold: spatiotemporal
mining in social media. SIGSPATIAL Special 7, 3 (January 2016), 19-25. DOI=http://dx.doi.org/10.1145/2876480.2876485
[5] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: real-time event detection by social
sensors. In Proceedings of the 19th International Conference on World Wide Web, pages 851–860. ACM, 2010.
28/10/16
Editor's Notes
SLIDE 1:
1.- Good morning.
2.- My name is Roberto. I'm PHD student of National Polytechnic Institute in Mexico City.
3.- Thanks for the invitation to be here today
4.- I’m talking about of“Geographical Knowledge Discovery applied to the Social Perception of Pollution in Mexico City”
5.- This research has the advice of Dr. Felix and Dr. Christophe Claramunt
7.- in recent years, air pollution in Mexico City has increased considerably
8.- The air pollution, it is a problem that requires analysis of multiple domains of knowledge
because actually we have more information in data sources more complex.
SLIDE 2:
Currently, social networks become increasingly relevant as a means of diffusion and sharing of citizen views.
In order to discover new knowledge in air pollution, We need to consider data from diferent sources, like:
Government, Social groups, social media and other web data.
In social media, the people make comments and observations, they might reflect important on different topics related in air pollution.
SLIDE 3.
1.-We reviewed three representative and heterogenous data sources:
2.- Government of Mexico City, because it generates information in traditional databases about pollution. The informaiton is trustworthy
3.- News media, it is an important element, because it provide a valuable source for deriving on-the-fly citizens opinions.
4.- For example, people in social networks express complaints, opinions, reports of problems and observations regarding air pollution topic,
5.- We consider the social networks as a instantaneous picture of the social perception of air pollution.
6.- Now, the question is: How can we cross this information to discover new confidence knowledge about pollution?
SLIDE 4:
1. Information produced by institutions has degree of certainty and veracity, It is assumed that it is true.
2. But.
3. All information produced in social networks ¿can be trustworthy?.
4.- What is the level of certainty in the information produced in social networks related to others sources?.
5.- This is the statement problem of this preliminary investigation.
SLIDE 5:
1.The information, sometimes needs to be verified to KNOW if it is correct or not
2. For example:
3. We have an inconsistency in the following two tweets about air quality
4. The IMECAS is the acronym of The Metropolitan Index of Air Quality in the city of Mexico.
5. In tweet 1: newspaper report that the imecas index is one hundred thirty five (135).
6. In tweet 2: newspaper report that the imecas index is one hundred twenty seven (127).
7. Which one have the correct information?.
8. How can we detect and resolve the inconsistency in the information?.
SLIDE 6:
1.- The papers have not a explicit relation with the geographic dimension
2.- And they don’t explore the certainty of information.
SLIDE 7:
1. It means, that we can discover the level of certain of the publications that appear in social media
2. by crossing these data with other additional formal of .
4. The geographic information can be used as a linker to different data sources.
SLIDE 8:
1.- We propose a GKD Framework for Air Polluttion that includes four Phases:
2.- Data extraction: is oriented to get information from social sources and newspapers.
3.- The processing phase: includes locations and sentiment detection.
4.- The Classification categoriza los datos en topicos especificos.
5.- Crossing data, helps to detect of level of information certainty.
SLIDE 9:
1.- For extraction, we consider tweets from accounts that periodically report data of air pollution, for example digital newspapers of Mexico
2.- Extraction continues using initial key phrases and hashtags, like #CDMX or #AirPollution.
4.- After, a data cleaning is developed: that includes tokenization, removing of stop words and stemming.
SLIDE 10:
1. Domain detection is pre-classify semantically tweets to a category of pollution, for example:
2. In tweet 2 the term “contamination" matches with the “pollution” class, by synonymy
3. Next, the word IMECAS matches with the class “IMECAS” that is a subclass of “IndexOfAirQuality”.
4. We can say, that the post is related to a pollution topic, it is a generic class.
5. it is possible that the tweet belongs to a more specific category that describes the nature of the post.
SLIDE 11:
1. In this part, we detect if the post is related to a positive or negative feeling by words or emoticons.
This detection is useful for identifying trends in the social perception of a specific topic of pollution, for example tweets positive to talk about politics and pollution.
2. Regarding the location of the tweet, we assume that each tweet contains the information in the metadata about of its place and time of publication.
3. Sometimes a tweet not contain explicit or implicit information that allows to define its location. In this case only it considered the time of publication for the following phases.
SLIDE 12:
1. If we detect to which specific category belongs each set of data:
we can select the data sources which should be crossed with the tweet , in order to discover new Knowledge and certainty .
2. The Tweet 2 is classified in a more specific category; health and pollution.
3. We choose C5 because, is one of the algorithms that have shown good performance in knolewge discovery in data bases.
Slide 13.
At this stage quantitative values and qualitative values are separated.
1) Using the ontology we can identify and separate the terms like: IMECAS, Air and Pollution.
2) The a numerical value IMECA is separated.
3) Now, we know that this value must be in a range from 0 to 201 according to definition of index IMECA. If this happens, we can say that we have found a valid value of air quality.
4) Is this possible that this approach does not work in some cases.
5) The Tweets do not contain information about its location but we consider the time of publication.
6) Using the IMECA value and time of Tweet, we proceed to search for matches in government data sources on air quality
Slide 14:
1. Through the categorization of the tweet, we know that we can exchange information with the database of air quality, because it is related to pollution and public health topics.
SLIDE 15:
1.- The Air Quality Data is provided by: Environmental monitoring ministry of CDMX goverment
SLIDE 16:
1. The tweet have not no location, but using its time component
2. We find in official data a match using the value of IMECA
3. Then, the official data help us to discover the tweet location
SLIDE 17:
1. In these additional results, we can see the classification of tweets by topic and location.
2. These results show the trend of social perception in certain subjects and geographical areas.
Slide 18.
1.- The information of dimensions.
2.- The domain discovery.