Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.
However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.
We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest. We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.
For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.
Iterative knowledge extraction from social networks. The Web Conference 2018Marco Brambilla
Knowledge in the world continuously evolves, and ontologies are largely incomplete, especially regarding data belonging to the so-called long tail. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds.
In this paper we address the following research questions: (1) How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds (2) How does the reconstructed domain knowledge spread geographically (3) Can the method be used to inspect the past, present, and future of knowledge (4) Can the method be used to find emerging knowledge?.
This work was presented at The Web Conference 2018, MSM workshop.
In online social media platforms, users can express their ideas by posting original content or by adding comments and responses to existing posts, thus generating virtual discussions and conversations. Studying these conversations is essential for understanding the online communication behavior of users. This study proposes a novel approach to retrieve popular patterns on online conversations using network-based analysis. The analysis consists of two main stages: intent analysis and network generation. Users’ intention is detected using keyword-based categorization of posts and comments, integrated with classification through Naïve Bayes and Support Vector Machine algorithms for uncategorized comments. A continuous human-in-the-loop approach further improves the keyword-based classification. To build and understand communication patterns among the users, we build conversation graphs starting from the hierarchical structure of posts and comments, using a directed multigraph network. The experiments categorize 90% comments with 98% accuracy on a real social media dataset. The model then identifies relevant patterns in terms of shape and content; and finally determines the relevance and frequency of the patterns. Results show that the most popular online discussion patterns obtained from conversation graphs resemble real-life interactions and communication.
Community analysis using graph representation learning on social networksMarco Brambilla
In a world more and more connected, new and complex interaction
patterns can be extracted in the communication between people.
This is extremely valuable for brands that can better understand
the interests of users and the trends on social media to better target
their products. In this paper, we aim to analyze the communities
that arise around commercial brands on social networks to understand
the meaning of similarity, collaboration, and interaction
among users.We exploit the network that builds around the brands
by encoding it into a graph model.We build a social network graph,
considering user nodes and friendship relations; then we compare
it with a heterogeneous graph model, where also posts and hashtags
are considered as nodes and connected to the different node
types; we finally build also a reduced network, generated by inducing
direct user-to-user connections through the intermediate
nodes (posts and hashtags). These different variants are encoded
using graph representation learning, which generates a numerical
vector for each node. Machine learning techniques are applied to
these vectors to extract valuable insights for each user and for the
communities they belong to. In the paper, we report on our experiments
performed on an emerging fashion brand on Instagram, and
we show that our approach is able to discriminate potential customers
for the brand, and to highlight meaningful sub-communities
composed by users that share the same kind of content on social
networks.
An introduction in the world of Social Network Analysis and a view on how this may help learning networks. History, data collection and several analysis techniques are shown.
2010 Catalyst Conference - Trends in Social Network AnalysisMarc Smith
Review of trends related to social network analysis in the enterprise. Presented at the 2010 Catalyst Conference in San Diego, CA july 29, 2010. Presented with Mike Gotta, Gartner Group.
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Iterative knowledge extraction from social networks. The Web Conference 2018Marco Brambilla
Knowledge in the world continuously evolves, and ontologies are largely incomplete, especially regarding data belonging to the so-called long tail. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds.
In this paper we address the following research questions: (1) How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds (2) How does the reconstructed domain knowledge spread geographically (3) Can the method be used to inspect the past, present, and future of knowledge (4) Can the method be used to find emerging knowledge?.
This work was presented at The Web Conference 2018, MSM workshop.
In online social media platforms, users can express their ideas by posting original content or by adding comments and responses to existing posts, thus generating virtual discussions and conversations. Studying these conversations is essential for understanding the online communication behavior of users. This study proposes a novel approach to retrieve popular patterns on online conversations using network-based analysis. The analysis consists of two main stages: intent analysis and network generation. Users’ intention is detected using keyword-based categorization of posts and comments, integrated with classification through Naïve Bayes and Support Vector Machine algorithms for uncategorized comments. A continuous human-in-the-loop approach further improves the keyword-based classification. To build and understand communication patterns among the users, we build conversation graphs starting from the hierarchical structure of posts and comments, using a directed multigraph network. The experiments categorize 90% comments with 98% accuracy on a real social media dataset. The model then identifies relevant patterns in terms of shape and content; and finally determines the relevance and frequency of the patterns. Results show that the most popular online discussion patterns obtained from conversation graphs resemble real-life interactions and communication.
Community analysis using graph representation learning on social networksMarco Brambilla
In a world more and more connected, new and complex interaction
patterns can be extracted in the communication between people.
This is extremely valuable for brands that can better understand
the interests of users and the trends on social media to better target
their products. In this paper, we aim to analyze the communities
that arise around commercial brands on social networks to understand
the meaning of similarity, collaboration, and interaction
among users.We exploit the network that builds around the brands
by encoding it into a graph model.We build a social network graph,
considering user nodes and friendship relations; then we compare
it with a heterogeneous graph model, where also posts and hashtags
are considered as nodes and connected to the different node
types; we finally build also a reduced network, generated by inducing
direct user-to-user connections through the intermediate
nodes (posts and hashtags). These different variants are encoded
using graph representation learning, which generates a numerical
vector for each node. Machine learning techniques are applied to
these vectors to extract valuable insights for each user and for the
communities they belong to. In the paper, we report on our experiments
performed on an emerging fashion brand on Instagram, and
we show that our approach is able to discriminate potential customers
for the brand, and to highlight meaningful sub-communities
composed by users that share the same kind of content on social
networks.
An introduction in the world of Social Network Analysis and a view on how this may help learning networks. History, data collection and several analysis techniques are shown.
2010 Catalyst Conference - Trends in Social Network AnalysisMarc Smith
Review of trends related to social network analysis in the enterprise. Presented at the 2010 Catalyst Conference in San Diego, CA july 29, 2010. Presented with Mike Gotta, Gartner Group.
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...1crore projects
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories.
Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing.
I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
Social Network Analysis Introduction including Data Structure Graph overview. Given in Cincinnati August 18th 2015 as part of the DataSeed Meetup group.
LAK13 Tutorial Social Network Analysis 4 Learning Analyticsgoehnert
Slides of the tutorial "Computational Methods and Tools for Social Network Analysis Networked Learning Communities" at the LAK 2013 in Leuven.
Tutorial Homepage:
http://snatutoriallak2013.ku.de/index.php/SNA_tutorial_at_LAK_2013
Conference Homepage:
http://lakconference2013.wordpress.com/
Evolving social data mining and affective analysis Athena Vakali
Evolving social data mining and affective analysis methodologies, framework and applications - Web 2.0 facts and social data
Social associations and all kinds of graphs
Evolving social data mining
Emotion-aware social data analysis
Frameworks and Applications
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...BAINIDA
Subscriber Churn Prediction Model using Social Network Analysis In Telecommunication Industry โดย เชษฐพงศ์ ปัญญาชนกุล อาจารย์ ดร. อานนท์ ศักดิ์วรวิชญ์
ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
I offer insights from my conversations with close to a dozen information and knowledge managers on how we can make the most strategic contribution to text and data mining projects. Presented at SLA 2021.
Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities.
In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data.
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories.
Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing.
I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
Social Network Analysis Introduction including Data Structure Graph overview. Given in Cincinnati August 18th 2015 as part of the DataSeed Meetup group.
LAK13 Tutorial Social Network Analysis 4 Learning Analyticsgoehnert
Slides of the tutorial "Computational Methods and Tools for Social Network Analysis Networked Learning Communities" at the LAK 2013 in Leuven.
Tutorial Homepage:
http://snatutoriallak2013.ku.de/index.php/SNA_tutorial_at_LAK_2013
Conference Homepage:
http://lakconference2013.wordpress.com/
Evolving social data mining and affective analysis Athena Vakali
Evolving social data mining and affective analysis methodologies, framework and applications - Web 2.0 facts and social data
Social associations and all kinds of graphs
Evolving social data mining
Emotion-aware social data analysis
Frameworks and Applications
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...BAINIDA
Subscriber Churn Prediction Model using Social Network Analysis In Telecommunication Industry โดย เชษฐพงศ์ ปัญญาชนกุล อาจารย์ ดร. อานนท์ ศักดิ์วรวิชญ์
ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
I offer insights from my conversations with close to a dozen information and knowledge managers on how we can make the most strategic contribution to text and data mining projects. Presented at SLA 2021.
Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities.
In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data.
Presentation for Harvard's ABCD Technology in Education group:
The Institute for Quantitative Social Science (IQSS) is a unique entity at Harvard - it combines research, software development, and specialized services to provide innovative solutions to research and scholarship problems at Harvard and beyond. I will talk about the software projects that IQSS is currently working on (Dataverse, Zelig, Consilience, and OpenScholar), including the research and development processes, the benefits provided to the Harvard community, and the impacts on research and scholarship.
Sentence embedding to improve rumour detection performance modelIAESIJAI
Recently, most individuals have preferred accessing the most recent news via social media platforms like Twitter as their primary source of information. Moreover, Twitter enables users to post and distribute tweets quickly and unsupervised. As a result, Twitter has become a popular platform for disseminating false information, such as rumours. These rumours were then propagated as accurate and influenced public opinion and decision-making. The issue will arise when a decision or policy with substantial consequences is made based on rumours. To avoid the negative impacts of rumours, several researchers have attempted to detect them automatically as early as feasible. Previous studies employed supervised learning methods to identify Twitter rumours and relied on feature extraction algorithms to extract tweet content and context elements. However, manually extracting features is time-consuming and labour-intensive. To encode each tweet's sentence as a vector based on its contextual meaning, we proposed utilising Bidirectional Encoder Representation of Transformer (BERT) as a sentence embedding. We then used these vectors to train some classifier models to detect rumours. Finally, we compared the performance of BERT-based models to feature engineering-based models. We discovered that the suggested BERT-based model improved all parameters by around 10% compared to the feature engineering-based classification model.
Online social media services like Facebook witness an exponential increase in user activity when an event takes place in the real
world. This activity is a combination of good quality content like information, personal views, opinions, comments, as well as
poor quality content like rumors, spam, and other malicious content. Although, the good quality content makes online social
media a rich source of information, consumption of poor quality content can degrade user experience, and have inappropriate
impact in the real world. In addition, the enormous popularity, promptness, and reach of online social media services across
the world makes it essential to monitor this activity, and minimize the production and spread of poor quality content. Multiple
studies in the past have analyzed the content spread on social networks during real world events. However, little work has
explored the Facebook social network. Two of the main reasons for the lack of studies on Facebook are the strict privacy
settings, and limited amount of data available from Facebook, as compared to Twitter. With over 1 billion monthly active
users, Facebook is about five times bigger than its next biggest counterpart Twitter, and is currently, the largest online social
network in the world. In this literature survey, we review the existing research work done on Facebook, and study the techniques
used to identify and analyze poor quality content on Facebook, and other social networks. We also attempt to understand the
limitations posed by Facebook in terms of availability of data for collection, and analysis, and try to understand if existing
techniques can be used to identify and study poor quality content on Facebook.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
Hierarchical Transformers for User Semantic Similarity - ICWE 2023Marco Brambilla
We discuss the use of hierarchical transformers for user semantic similarity in the context of analyzing users' behavior and profiling social media users. The objectives of the research include finding the best model for computing semantic user similarity, exploring the use of transformer-based models, and evaluating whether the embeddings reflect the desired similarity concept and can be used for other tasks.
We use a large dataset of Twitter users and apply an automatic labeling approach. The dataset consists of English tweets posted in November and December 2020, totaling about 27GB of compressed data. Preprocessing steps include filtering out short texts, cleaning user connections, and selecting a benchmark set of users for evaluation.
Since Transformer architectures are known to work well on short text, we cannot use them on extensive collections of tweets describing the activity of a user. Therefore, we propose a hierarchical structure of transformer models to be used first on tweets and then on their aggregations.
The models used in the study include hierarchical transformers, and the tweet embeddings are obtained using four Transformer-based models: RoBERTa2, BERTweet3, Sentence BERT4, and Twitter4SSE5. The researchers test different techniques for processing tweet embeddings to generate accurate user embeddings, including mean pooling, recurrence over BERT (RoBERT), and transformer over BERT (ToBERT).
The evaluation of the models is done on a set of 5,000 users, comparing user similarities with 30 other candidate users, 5 of which are considered similar and 25 considered dissimilar. The evaluation metrics used include mean average precision (MAP), mean reciprocal rank (MRR) at 10, and normalized discounted cumulative gain (nDCG).
The optimization process involves selecting a loss function and using the AdamW optimizer with specific hyperparameters. The results show that the hierarchical approach with a Stage-1 Twitter4SSE model and a Stage-2 Transformer model performs the best among the alternatives.
In conclusion, the research provides a large unbiased dataset for user similarity analysis, presents a hierarchical language model optimized for accurate user similarity computation, and validates the models' performance on similarity tasks, with potential applications to related problems.
The future work includes investigating the impact of time and topic drift on the models' performance.
Exploring the Bi-verse.A trip across the digital and physical ecospheresMarco Brambilla
The Web and social media are the environments where people post their content, opinions, activities, and resources. Therefore, a considerable amount of user-generated content is produced every day for a wide variety of purposes. On the other side, people live their everyday life immersed in the physical world, where society, economy, politics and personal relations continuously evolve. These two opposite and complementary environment are today fully integrated: they reflect each other and they interact with each other in a stronger and stronger way.
Exploring and studying content and data coming from both environments offers a great opportunity to understand the ever evolving modern society, in terms of topics of interest, events, relations, and behaviour.
In this speech I will discuss through business cases and socio-political scenarios how we can extract insights and understand reality by combining and analyzing data from the digital and physical world, so as to reach a better overall picture of reality itself. Along this path, we need to keep into account that reality is complex and varies in time, space and along many other dimensions, including societal and economic variables. The speech highlights the main challenges that need to be addressed and outlines some data science strategies that can be applied to tackle these specific challenges.
This slide deck has been presented as a keynote speech at WISE 2022 in Biarritz, France.
Trigger.eu: Cocteau game for policy making - introduction and demoMarco Brambilla
COCTEAU stands for "Co-Creating the European Union".
It's a project supported by the European Union whose objective is to involve citizens to cooperate alongside policy makers, contributing to build a better future.
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Marco Brambilla
A large audience of users and typically a long time frame are needed to produce sensible and useful log data, making it an expensive task.
To address this limit, we propose a method that focuses on the generation of REALISTIC NAVIGATIONAL PATHS, i.e., web logs .
Our approach is extremely relevant because it can at the same time tackle the problem of lack of publicly available data about web navigation logs, and also be adopted in industry for AUTOMATIC GENERATION OF REALISTIC TEST SETTINGS of Web sites yet to be deployed.
The generation has been implemented using deep learning methods for generating more realistic navigation activities, namely
Recurrent Neural Network, which are very well suited to temporally evolving data
Generative Adversarial Network: neural networks aimed at generating new data, such as images or text, very similar to the original ones and sometimes indistinguishable from them, that have become increasingly popular in recent years.
We run experiments using open data sets of weblogs as training, and we run tests for assessing the performance of the methods. Results in generating new weblog data are quite good with respect to the two evaluation metrics adopted (BLEU and Human evaluation).
Our study is described in detail in the paper published at ICWE 2020 – International Conference on Web Engineering with DOI: 10.1007/978-3-030-50578-3. It’s available online on the Springer Web site.
Analyzing rich club behavior in open source projectsMarco Brambilla
The network of collaborations in an open source project can reveal relevant emergent properties that influence its prospects of success.
In this work, we analyze open source projects to determine whether they exhibit a rich-club behavior, i.e., a phenomenon where contributors with a high number of collaborations (i.e., strongly connected within the collaboration network)
are likely to cooperate with other well-connected individuals. The presence or absence of a rich-club has an impact on the sustainability and robustness of the project.
For this analysis, we build and study a dataset with the 100 most popular projects in GitHub, exploiting connectivity patterns in the graph structure of collaborations that arise from commits, issues and pull requests. Results show that rich-club behavior is present in all the projects, but only few of them have an evident club structure. We compute coefficients both for single source graphs and the overall interaction graph, showing that rich-club behavior varies across different layers of software development. We provide possible explanations of our results, as well as implications for further analysis.
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Marco Brambilla
In this study, we demonstrate that the computational social science is important to understand people behavior in political phenomena, and based on the long-running Brexit debate analysis on Twitter, we predict the public stance, discussion topics, and we measure the involvement of automated accounts and politicians’ social media accounts.
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...Marco Brambilla
Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage ``better" driving behaviour through immediate feedback
while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour.
This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance
policies and car rentals.
Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different
behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined groundtruth on drivers classification. The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.
Myths and challenges in knowledge extraction and analysis from human-generate...Marco Brambilla
For centuries, science (in German "Wissenschaft") has aimed to create ("schaften") new knowledge ("Wissen") from the observation of physical phenomena, their modelling, and empirical validation. Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted. The challenge is once again to capture and create knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data). The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.
The talk explores the mixed realistic-utopian domain of knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Marco Brambilla
Knowledge bases like DBpedia, Yago or Google's Knowledge
Graph contain huge amounts of ontological knowledge harvested from
(semi-)structured, curated data sources, such as relational databases or
XML and HTML documents. Yet, the Web is full of knowledge that is
not curated and/or structured and, hence, not easily indexed, for ex-
ample social data. Most work so far in this context has been dedicated
to the extraction of entities, i.e., people, things or concepts. This poster
describes our work toward the extraction of relationships among entities.
The objective is reconstructing a typed graph of entities and relation-
ships to represent the knowledge contained in social data, without the
need for a-priori domain knowledge. The experiments with real datasets
show promising performance across a variety of domains.
The key distinguishing
feature of the work is its focus on highly unstructured social data (tweets and
Facebook posts) without reliable grammar structures. Traditional relation extraction approaches supervised , semi-supervised or unsupervised,
commonly assume the availability of grammatically correct language corpora.
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...Marco Brambilla
Internet of Things technologies and applications are evolving and continuously gaining traction in all fields and environments, including homes, cities, services, industry and commercial enterprises. However, still many problems need to be addressed. For instance, the
IoT vision is mainly focused on the technological and infrastructure aspect, and on the management and analysis of the huge amount of generated data, while so far the development of front-end and user interfaces for
IoT has not played a relevant role in research. On the contrary, user interfaces in the IoT ecosystem they can play a key role in the acceptance of solutions by final adopters. In this paper we present a model-driven approach to the design of IoT interfaces, by defining a specific visual design language and design patterns for IoT\ applications, and we show them at work. The language we propose is defined as an extension of the OMG standard language called IFML.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.Marco Brambilla
Consumer-centered software applications nowadays are required
to be available both as mobile and desktop versions.
However, the app design is frequently made only for one of
the two (i.e., mobile first or web first) while missing an appropriate
design for the other (which, in turn, simply mimics
the interaction of the first one). This results into poor quality
of the interaction on one or the other platform. Current solutions
would require different designs, to be realized through
different design methods and tools, and that may require to
double development and maintenance costs.
In order to mitigate such an issue, this paper proposes a
novel approach that supports the design of both web and mobile
applications at once. Starting from a unique requirement
and business specification, where web– and mobile–specific
aspects are captured through tagging, we derive a platform independent
design of the system specified in IFML. This
model is subsequently refined and detailed for the two platforms,
and used to automatically generate both the web and
mobile versions. If more precise interactions are needed for
the mobile part, a blending with MobML, a mobile-specific
modeling language, is devised. Full traceability of the relations
between artifacts is granted.
The Web Science course focuses on the study of large-scale socio-technical systems associated with the World Wide Web. It considers the relationship between people and technology, the ways that society and technology complement one another and the way they impact on broader society. These analyses are inherently associated with Big Data management issues.
The course is organised in four parts.
1. Syntax
In the first part, the course introduces the basis of content analysis. If focuses on the syntactic aspects, covering the fundamentals of natural language processing and text mining. It describes the structure and typical characteristics of the different web sources, spanning search results, social media contents, social network structures, Web APIs, and so on. It also provides an overview of the basic Web analysis techniques applied in Web search and Web recommendation.
2. Semantics
In the second part, the course presents semantic technologies. These technologies are very important nowadays because they allow to treat the "variety" dimension of Big Data, i.e., they enable integration of multiple and diverse sources of information, which is typical on the modern Web platform. Covered topics include:
- RDF - a flexible data model to represent heterogeneous data
- OWL - a flexible ontological language to model heterogeneous data sources
- SPARQL - a query language for RDF.
It shows how to put all the pieces together in order to achieve interoperability among heterogeneous information sources
3. Time
The third part covers the realm of temporal-dependent data. The topics covered here allow to treat the "velocity" dimension of Big Data. It shows the importance for many Big Data analysis scenarios to process data stream, coming for instance from Internet of Things (IoT) and Social Media sources; and it describes how to apply semantic and syntactic techniques in the context of time-dependent information. For instance, it shows how to extend RDF to model RDF streams, how to extend SPARQL to continuously process RDF streams and how to reason on those RDF Streams
4. Applications
In the fourth part, the course focuses on specific application scenarios and presents the typical settings and problems where the presented techniques can be applied. This part discusses settings such as: big data analysis for smart cities; data analytics for brand monitoring (marketing) and event monitoring; data analysis for trend detection and user engagement; and so on.
On the Quest for Changing Knowledge. Capturing emerging entities from social ...Marco Brambilla
Massive data integration technologies have been recently used to produce very large ontologies. However, knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.
Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately
reflects the relevant changes which hide emerging entities. Thus, we propose a method for
discovering emerging entities by extracting them from social content.
We start from a purely-syntactic method as a baseline, and we propose a semantics-based method based on entity recognition and DBpedia. The method associates candidate entities to feature vectors, built
from social content by using co-occurrence, and then extracts the emerging entities by using feature similarity measures.
Once instrumented by experts through very simple initialization, the method is capable of finding emerging
entities and extracting their relevant relationships to given types; the method can be
continuously or periodically iterated, using the already identified emerged knowledge as new starting point.
We validate our method by applying it to a set of diverse domain-specific application scenarios, spanning fashion, literature, exhibitions and so on. We show the approach at work and we demonstrate its effectiveness on datasets with different characterization in terms of coverage, dynamics and size.
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Marco Brambilla
Cities are growing as melting pots of people with different culture, religion, and language. In this paper, through multilingual analysis of Twitter contents shared within a city, we analyze the prevalent language in the different neighborhoods of the city and we compare the results with census data, in order to highlight any parallelisms or discrepancies between the two data sources. We show that the officially identified neighborhoods are actually representing significantly different communities and that the use of the social media as a data source helps to detect those weak signals that are not captured from traditional data.
Model driven software engineering in practice book - Chapter 9 - Model to tex...Marco Brambilla
Slides for the mdse-book.com chapter 9 - Model-to-text transformations.
Complete set of slides now available:
Chapter 1 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-1-introduction
Chapter 2 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-2-mdse-principles
Chapter 3 - http://www.slideshare.net/jcabot/model-driven-software-engineering-in-practice-chapter-3-mdse-use-cases
Chapter 4 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-4
Chapter 5 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-5-integration-of-modeldriven-in-development-processes
Chapter 6 - http://www.slideshare.net/jcabot/mdse-bookslideschapter6
Chapter 7 - http://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-7-developing-your-own-modeling-language
Chapter 8 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-8-modeltomodel-transformations
Chapter 9 - https://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-9-model-to-text-transformations-and-code-generation
Chapter 10 - http://www.slideshare.net/jcabot/mdse-bookslideschapter10managingmodels
This book discusses how approaches based on modeling can improve the daily practice of software professionals. This is known as Model-Driven Software Engineering (MDSE) or, simply, Model-Driven Engineering (MDE).
MDSE practices have proved to increase efficiency and effectiveness in software development. MDSE adoption in the software industry is foreseen to grow exponentially in the near future, e.g., due to the convergence of software development and business analysis.
This book is an agile and flexible tool to introduce you to the MDE and MDSE world, thus allowing you to quickly understand its basic principles and techniques and to choose the right set of MDE instruments for your needs so that you can start to benefit from MDE right away.
The book is organized into two main parts.
The first part discusses the foundations of MDSE in terms of basic concepts (i.e., models and transformations), driving principles, application scenarios and current standards, like the wellknown MDA initiative proposed by OMG (Object Management Group) as well as the practices on how to integrate MDE in existing development processes.
The second part deals with the technical aspects of MDSE, spanning from the basics on when and how to build a domain-specific modeling language, to the description of Model-to-Text and Model-to-Model transformations, and the tools that support the management of MDE projects.
The book covers a wide set of introductory and technical topics, spanning MDE at large, definitions and orientation in the MD* world, metamodeling, domain specific languages, model transformations, reverse engineering, OMG's MDA, UML, OCL, A
Model driven software engineering in practice book - chapter 7 - Developing y...Marco Brambilla
Slides for the mdse-book.com - Chapter 7: Developing Your Own Modeling Language.
Complete set of slides now available:
Chapter 1 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-1-introduction
Chapter 2 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-2-mdse-principles
Chapter 3 - http://www.slideshare.net/jcabot/model-driven-software-engineering-in-practice-chapter-3-mdse-use-cases
Chapter 4 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-4
Chapter 5 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-5-integration-of-modeldriven-in-development-processes
Chapter 6 - http://www.slideshare.net/jcabot/mdse-bookslideschapter6
Chapter 7 - http://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-7-developing-your-own-modeling-language
Chapter 8 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-8-modeltomodel-transformations
Chapter 9 - https://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-9-model-to-text-transformations-and-code-generation
Chapter 10 - http://www.slideshare.net/jcabot/mdse-bookslideschapter10managingmodels
This book discusses how approaches based on modeling can improve the daily practice of software professionals. This is known as Model-Driven Software Engineering (MDSE) or, simply, Model-Driven Engineering (MDE).
MDSE practices have proved to increase efficiency and effectiveness in software development. MDSE adoption in the software industry is foreseen to grow exponentially in the near future, e.g., due to the convergence of software development and business analysis.
This book is an agile and flexible tool to introduce you to the MDE and MDSE world, thus allowing you to quickly understand its basic principles and techniques and to choose the right set of MDE instruments for your needs so that you can start to benefit from MDE right away.
The first part discusses the foundations of MDSE in terms of basic concepts (i.e., models and transformations), driving principles, application scenarios and current standards, like the wellknown MDA initiative proposed by OMG (Object Management Group) as well as the practices on how to integrate MDE in existing development processes.
The second part deals with the technical aspects of MDSE, spanning from the basics on when and how to build a domain-specific modeling language, to the description of Model-to-Text and Model-to-Model transformations, and the tools that support the management of MDE projects.
The book covers the MD* world, metamodeling, domain specific languages, model transformations, reverse engineering, OMG's MDA, UML, OCL, ATL, QVT, MOF, Eclipse, EMF, GMF, TCS, xText.
Automatic code generation for cross platform, multi-device mobile apps. An in...Marco Brambilla
This presentation was given at the MobileDeLi workshop 2015 collocated with the Splash 2015 conference.
With the continuously increasing adoption of mobile devices,
software development companies have new business opportunities
through direct sales in app stores and delivery of
business to employee (B2E) and business to business (B2B)
solutions. However, cross-platform and multi-device development
is a barrier for today's IT solution providers, especially
small and medium enterprises (SMEs), due to the high
cost and technical complexity of targeting development to a
wide spectrum of devices, which dier in format, interaction
paradigm, and software architecture. So far, several authors
have proposed the application of model driven approaches
to mobile apps development following a variety of strategies.
In this paper we present the results of a research study conducted
to nd the best strategy for WebRatio, a software
development company, interested in producing a MDD tool
for designing and developing mobile apps to enter the mobile
apps market. We report on a comparative study conducted
to identify the best trade-os between various automatic
code generation approaches.
Unlock TikTok Success with Sociocosmos..SocioCosmos
Discover how Sociocosmos can boost your TikTok presence with real followers and engagement. Achieve your social media goals today!
https://www.sociocosmos.com/product-category/tiktok/
Surat Digital Marketing School is created to offer a complete course that is specifically designed as per the current industry trends. Years of experience has helped us identify and understand the graduate-employee skills gap in the industry. At our school, we keep up with the pace of the industry and impart a holistic education that encompasses all the latest concepts of the Digital world so that our graduates can effortlessly integrate into the assigned roles.
This is the place where you become a Digital Marketing Expert.
Your Path to YouTube Stardom Starts HereSocioCosmos
Skyrocket your YouTube presence with Sociocosmos' proven methods. Gain real engagement and build a loyal audience. Join us now.
https://www.sociocosmos.com/product-category/youtube/
Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...SocioCosmos
Get more Pinterest followers, reactions, and repins with Sociocosmos, the leading platform to buy all kinds of Pinterest presence. Boost your profile and reach a wider audience.
https://www.sociocosmos.com/product-category/pinterest/
Multilingual SEO Services | Multilingual Keyword Research | Filosemadisonsmith478075
Multilingual SEO services are essential for businesses aiming to expand their global presence. They involve optimizing a website for search engines in multiple languages, enhancing visibility, and reaching diverse audiences. Filose offers comprehensive multilingual SEO services designed to help businesses optimize their websites for search engines in various languages, enhancing their global reach and market presence. These services ensure that your content is not only translated but also culturally and contextually adapted to resonate with local audiences.
Visit us at -https://www.filose.com/
“To be integrated is to feel secure, to feel connected.” The views and experi...AJHSSR Journal
ABSTRACT: Although a significant amount of literature exists on Morocco's migration policies and their
successes and failures since their implementation in 2014, there is limited research on the integration of subSaharan African children into schools. This paperis part of a Ph.D. research project that aims to fill this gap. It
reports the main findings of a study conducted with migrant children enrolled in two public schools in Rabat,
Morocco, exploring how integration is defined by the children themselves and identifying the obstacles that they
have encountered thus far. The following paper uses an inductive approach and primarily focuses on the
relationships of children with their teachers and peers as a key aspect of integration for students with a migration
background. The study has led to several crucial findings. It emphasizes the significance of speaking Colloquial
Moroccan Arabic (Darija) and being part of a community for effective integration. Moreover, it reveals that the
use of Modern Standard Arabic as the language of instruction in schools is a source of frustration for students,
indicating the need for language policy reform. The study underlines the importanceof considering the
children‟s agency when being integrated into mainstream public schools.
.
KEYWORDS: migration, education, integration, sub-Saharan African children, public school
Enhance your social media strategy with the best digital marketing agency in Kolkata. This PPT covers 7 essential tips for effective social media marketing, offering practical advice and actionable insights to help you boost engagement, reach your target audience, and grow your online presence.
Grow Your Reddit Community Fast.........SocioCosmos
Sociocosmos helps you gain Reddit followers quickly and easily. Build your community and expand your influence.
https://www.sociocosmos.com/product-category/reddit/
Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...AJHSSR Journal
ABSTRACT: In the Malaysian context, small and medium enterprises (SMEs) experience a significant
burden of workplace accidents. A consensus among scholars attributes a substantial portion of these incidents to
human factors, particularly unsafe behaviors. This study, conducted in Malaysia's northern region, specifically
targeted Safety and Health/Human Resource professionals within the manufacturing sector of SMEs. We
gathered a robust dataset comprising 107 responses through a meticulously designed self-administered
questionnaire. Employing advanced partial least squares-structural equation modeling (PLS-SEM) techniques
with SmartPLS 3.2.9, we rigorously analyzed the data to scrutinize the intricate relationship between safety
behavior and safety performance. The research findings unequivocally underscore the palpable and
consequential impact of safety behavior variables, namely safety compliance and safety participation, on
improving safety performance indicators such as accidents, injuries, and property damages. These results
strongly validate research hypotheses. Consequently, this study highlights the pivotal significance of cultivating
safety behavior among employees, particularly in resource-constrained SME settings, as an essential step toward
enhancing workplace safety performance.
KEYWORDS :Safety compliance, safety participation, safety performance, SME
4. Is it possible to extract a sub-selection of content
items if and only if they are actually relevant to the
topic or context of interest ?
5. Examples to Related Studies
1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010
2. Detection of influenza-like illnesses
Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010
3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014
4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016
5. Tracking baseball and fashion topics, Lin et al. KDD, 2011
6. Event detection system, Kunneman & Bosch, BNAIC, 2014
7. Credibility of trend-topic hashtag usage
Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011
8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017
10. Best #Hotel Deals in #Pompei #HotelDegliAmiciPompei
starting at EUR99.60 https://t.co/5DxkKn4o69
Pompei Hero Pliny the Elder May Have Been Found 2000
Years Later https://t.co/PyR2rP1Xpe #2017Rewind #archeology
#history #Pompei #rome #RomanEmpire
Non-Relevant Tweet
Relevant Tweet
11. 4 feature extraction strategies
evaluated
N-grams (unigrams, bigrams, trigrams)
Word2Vec
Word2Vec + additional tweet features
Dimensionality Reduction with PCA
12. Annotated Data
726 tweets.
Contains tweet having specific hashtags and keywords related to
Pompei, Colosseo and Teatro Alla Scala
The data contains 50% relevant and 50% non relevant.
13. Model 1: Text transformation to ngrams
# of [unigrams, bigrams, trigrams] : [494,287,228]
vocabulary size: 1009 words
Example:
14. Model 2: Text transformation to word2vec
• Word2vec dimension is selected as 25.
• Word2vec vocabulary is built with 12K unlabeled tweets
• Preprocessing operations before building word2vec model
• Convert to lowercase,
• Discard
• Web links
• Words with character size < 3
• Stopwords are eliminated before model building.
17. Model 3: word2vec + Additional Features
Tweet Author
Full text: #nuovi #corsi #inglese #settembre
#pompei #chiamaci per #info
https://t.co/QRrXlMC0g1
Number of Friends: 4
Number of Followers: 9
Number of Lists: 15
Number of Favourited Tweets: 0
Language: en Number of Tweets: 4220
Source: PostPickr Verified Account: False
Number of Favorited: 0 Geo Enabled: False
Number of Retweets: 0 Default Profile: False
Example:
21. Collecting more data to build
larger Word2Vec Vocabulary
New Use Cases
Challenges ahead
22. THANKS!
QUESTIONS?
Emre Calisir, Marco Brambilla
The Problem of Data Cleaning for Knowledge Extraction from Social Media
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi
Editor's Notes
500 million tweets each day.
100 million active daily people.
People tend to share every part of their lives, opinions...
Every sector wants to extract knowledge from social media: banking, health, tourism...
Data is created by humans.
Very open to errors.
Words can have multiple meanings.
Rule-based filtering systems fail. (we mean by Rule-based systems systems that collect tweets just using queries)
They bring related and unrelated content together.
Our research question.
All of them are based on Twitter data
Various use cases, various machine learning techniques
Supervised (1 to 7) SVM, Multinomial Bayes, Logistic Regression, Decision Trees, Bayes Networks,
Unsupervised (Last one) Latent Dirichlet Allocation
We benefit from these studies.
Early Alert System. Technique: SVM classifier. Features: keywords and the context of target-event words.
Detection of Illnesses. Technique: Log Regr. Bag-of-Words
Filter-out Non-Relevant Content. Technique: Log. Regr. Unigrams-Bigrams-Trigrams. Very similar to first model in our study.
Discovery of patterns of abuse of specific medications.
Topic tracking on tweet streams. Again unigrams
Discard noisy data for an event detection system based on Twitter stream data.
Assesment of credibility of trend-topics, to check if they are really related with the topic of used hash-tag, or spam. They tried SVM, decision trees and Bayes.
Unsupervised technique. Pooling method using Information Retrieval and Latent Dirichlet Allocation.
Annotated dataset, labeled as Relevant or Non-Relevant with the topic.
We train SVM based model with labeled data.
We predict the new unlabeled data.
We selected Support Vector Machines with Linear Kernel. The recommended approach when there is sparse vector features.
Our basic approach to obtain a relevant dataset.
We applied with Twitter but is also applicable to any kind of social media data.
And a more detailed illustration to general flow.
Initially we build our model.
And then we make the prediction.
Finally we have a clean dataset.
Pompei, Colosseo and Teatro Alla Scala
Imagine that we have a social media monitoring tool.
We want to track tweets related historical value of Pompei.
How to do it?
Label relevant and non-relevant data.
Who will label it? Subject experts
Subject experts are labeled the tweets as relevant and non relevant.
A random guess could predict with 0.5 accuracy.
Ngrams is a widely used technique in text classification.
It is an example to how we transform a tweet to feature vector.
Also we applied Tf-idf to increase accuracy. It is a best-practive in ngrams usage.
Default word2vec dimension is 100. However, our dataset is limited. We performed analysis and achieved better results by representing with 25 features.
Larger the word2vec vocabulary, better creating the semantic relations.
Top 3 Similarities showed for specific words.
Inside parantheses, there is vectorial distance btw words.
For a given tweet, we calculate the average value of its word vectors.
Another illustration of trained word2vec model.
This graph is generated with few words to show you just the vectorial representations of words.
Feature Extraction Strategy
Vector features: Word2Vec Numerical features: MinMaxScaler Categorical features: OneHotEncoder
Dimension of features after transformation:
dim(Word2Vec) = 25 -- dim(numerical features) = 7 -- dim(one-hot-encoding) = 68 -- Total number of features = 125
The target is to analyze impact of dimensionality reduction.
It could improve model accuracy.
Graphic:
Desired dimension < 40 the model is underfitting
Desired dimension > 40 the model is overfitting
We selected dimension size : 40
We preferred cross-validation because size of our labeled dataset is limited. (726 tweets)
All the classification models are succesful.
Dotted line shows random guess. (50% relevant and 50% non relevant tweets exist inside data)
Model 1 has best performance.
Model 4 has second best performance.
If we have a larger word2vec vocabulary, word2vec could have better accuracy then ngrams.
ROC curve and AUC scores prove the performance of our model.
We addressed our research question.
We proved that text classsification is a very convenient way to obtain topic relevant data.
We still a way to go for building a more accurate system.
We now create a larger dataset. And we will find out new use cases, also open datasets to compare our results with the literature.
Also, we will try other algorithms, not only SVM.
Also, we can bring external data by importing content from given weblinks.