Data Cleaning for social media knowledge extraction

•Download as PPTX, PDF•

1 like•1,617 views

Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities. However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results. We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest. We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues. For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.

Social Media

Keyword or hash-tag
based filtering is
insufficient

Is it possible to extract a sub-selection of content
items if and only if they are actually relevant to the
topic or context of interest ?

Examples to Related Studies
1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010
2. Detection of influenza-like illnesses
Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010
3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014
4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016
5. Tracking baseball and fashion topics, Lin et al. KDD, 2011
6. Event detection system, Kunneman & Bosch, BNAIC, 2014
7. Credibility of trend-topic hashtag usage
Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011
8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017

Supervised learning
trained on annotated
data could help us

Overview
Topic
Relevancy
Detection
Machine
Topic Relevant
Dataset

Proposed Data Cleaning Method for
Knowledge Extraction

Best #Hotel Deals in #Pompei #HotelDegliAmiciPompei
starting at EUR99.60 https://t.co/5DxkKn4o69
Pompei Hero Pliny the Elder May Have Been Found 2000
Years Later https://t.co/PyR2rP1Xpe #2017Rewind #archeology
#history #Pompei #rome #RomanEmpire
Non-Relevant Tweet
Relevant Tweet

4 feature extraction strategies
evaluated
N-grams (unigrams, bigrams, trigrams)
Word2Vec
Word2Vec + additional tweet features
Dimensionality Reduction with PCA

Annotated Data
726 tweets.
Contains tweet having specific hashtags and keywords related to
Pompei, Colosseo and Teatro Alla Scala
The data contains 50% relevant and 50% non relevant.

Model 1: Text transformation to ngrams
# of [unigrams, bigrams, trigrams] : [494,287,228]
vocabulary size: 1009 words
Example:

Model 2: Text transformation to word2vec
• Word2vec dimension is selected as 25.
• Word2vec vocabulary is built with 12K unlabeled tweets
• Preprocessing operations before building word2vec model
• Convert to lowercase,
• Discard
• Web links
• Words with character size < 3
• Stopwords are eliminated before model building.

Model 2: Text transformation to word2vec

Model 1 2 3 4
Accuracy 0.84 0.81 0.82 0.83
Precision 0.84 0.78 0.83 0.84
Recall 0.83 0.86 0.8 0.81
F1 0.83 0.82 0.81 0.82
Model 1: ngrams
Model 2: word2vec
Model 3: word2vec + additional features
Model 4: PCA applied on Model 3
10fold Cross-Validated Results

Conclusions
Supervised Machine Learning
techniques could help to obtain topic
relevant social media data

Collecting more data to build
larger Word2Vec Vocabulary
New Use Cases
Challenges ahead

THANKS!
QUESTIONS?
Emre Calisir, Marco Brambilla
The Problem of Data Cleaning for Knowledge Extraction from Social Media
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Knowledge in the world continuously evolves, and ontologies are largely incomplete, especially regarding data belonging to the so-called long tail. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds. In this paper we address the following research questions: (1) How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds (2) How does the reconstructed domain knowledge spread geographically (3) Can the method be used to inspect the past, present, and future of knowledge (4) Can the method be used to find emerging knowledge?. This work was presented at The Web Conference 2018, MSM workshop.

Conversation graphs in Online Social Media

Marco Brambilla

In online social media platforms, users can express their ideas by posting original content or by adding comments and responses to existing posts, thus generating virtual discussions and conversations. Studying these conversations is essential for understanding the online communication behavior of users. This study proposes a novel approach to retrieve popular patterns on online conversations using network-based analysis. The analysis consists of two main stages: intent analysis and network generation. Users’ intention is detected using keyword-based categorization of posts and comments, integrated with classification through Naïve Bayes and Support Vector Machine algorithms for uncategorized comments. A continuous human-in-the-loop approach further improves the keyword-based classification. To build and understand communication patterns among the users, we build conversation graphs starting from the hierarchical structure of posts and comments, using a directed multigraph network. The experiments categorize 90% comments with 98% accuracy on a real social media dataset. The model then identifies relevant patterns in terms of shape and content; and finally determines the relevance and frequency of the patterns. Results show that the most popular online discussion patterns obtained from conversation graphs resemble real-life interactions and communication.

Available Data Science M.Sc. Thesis Proposals

Marco Brambilla

Community analysis using graph representation learning on social networks

Marco Brambilla

In a world more and more connected, new and complex interaction patterns can be extracted in the communication between people. This is extremely valuable for brands that can better understand the interests of users and the trends on social media to better target their products. In this paper, we aim to analyze the communities that arise around commercial brands on social networks to understand the meaning of similarity, collaboration, and interaction among users.We exploit the network that builds around the brands by encoding it into a graph model.We build a social network graph, considering user nodes and friendship relations; then we compare it with a heterogeneous graph model, where also posts and hashtags are considered as nodes and connected to the different node types; we finally build also a reduced network, generated by inducing direct user-to-user connections through the intermediate nodes (posts and hashtags). These different variants are encoded using graph representation learning, which generates a numerical vector for each node. Machine learning techniques are applied to these vectors to extract valuable insights for each user and for the communities they belong to. In the paper, we report on our experiments performed on an emerging fashion brand on Instagram, and we show that our approach is able to discriminate potential customers for the brand, and to highlight meaningful sub-communities composed by users that share the same kind of content on social networks.

The Basics of Social Network Analysis

Rory Sie

2010 Catalyst Conference - Trends in Social Network Analysis

Marc Smith

A comparative study of social network analysis toolsDavid Combe

Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...

1crore projects

IEEE PROJECTS 2015 1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider. It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training. Dot Net DOTNET Project Domain list 2015 1. IEEE based on datamining and knowledge engineering 2. IEEE based on mobile computing 3. IEEE based on networking 4. IEEE based on Image processing 5. IEEE based on Multimedia 6. IEEE based on Network security 7. IEEE based on parallel and distributed systems Java Project Domain list 2015 1. IEEE based on datamining and knowledge engineering 2. IEEE based on mobile computing 3. IEEE based on networking 4. IEEE based on Image processing 5. IEEE based on Multimedia 6. IEEE based on Network security 7. IEEE based on parallel and distributed systems ECE IEEE Projects 2015 1. Matlab project 2. Ns2 project 3. Embedded project 4. Robotics project Eligibility Final Year students of 1. BSc (C.S) 2. BCA/B.E(C.S) 3. B.Tech IT 4. BE (C.S) 5. MSc (C.S) 6. MSc (IT) 7. MCA 8. MS (IT) 9. ME(ALL) 10. BE(ECE)(EEE)(E&I) TECHNOLOGY USED AND FOR TRAINING IN 1. DOT NET 2. C sharp 3. ASP 4. VB 5. SQL SERVER 6. JAVA 7. J2EE 8. STRINGS 9. ORACLE 10. VB dotNET 11. EMBEDDED 12. MAT LAB 13. LAB VIEW 14. Multi Sim CONTACT US 1 CRORE PROJECTS Door No: 214/215,2nd Floor, No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai, Tamin Nadu, INDIA - 600 026 Email id: 1croreprojects@gmail.com website:1croreprojects.com Phone : +91 97518 00789 / +91 72999 51536

The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories. Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing. I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.

Data mining based social network

Firas Husseini

Social Network Analysis: Applications & Challenges

IIIT Hyderabad

Social network analysis course 2010 - 2011

guillaume ereteo

Social Network AnalysisGiorgos Cheliotis

Social Network Analysis Introduction including Data Structure Graph overview.

Doug Needham

Social Network Analysis

Fred Stutzman

LAK13 Tutorial Social Network Analysis 4 Learning Analytics

goehnert

Evolving social data mining and affective analysis

Athena Vakali

Social network analysis & Big Data - Telecommunications and moreWael Elrifai

02 Network Data Collection

dnac

Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...

BAINIDA

Subscriber Churn Prediction Model using Social Network Analysis In Telecommunication Industry โดย เชษฐพงศ์ ปัญญาชนกุล อาจารย์ ดร. อานนท์ ศักดิ์วรวิชญ์ ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND

Social Network Analysis (SNA)

Development Innovations

05 Network Canvas (2017)

Duke Network Analysis Center

Social network analysis

World Agroforestry (ICRAF)

01 Introduction to Networks Methods and Measures

dnac

Overview Of Network Analysis PlatformsNoah Flower

Social Network Analysis (SNA) 2018

Arsalan Khan

Top 10 Factors for Successful TDM Projects

Mary Ellen Bates

News construction from microblogging post using open data

Francisco Berrizbeitia

Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities. In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data.

IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...

IRJET Journal

IRJET- Tweet Segmentation and its Application to Named Entity Recognition

IRJET Journal

What's hot

Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...

Xiaohan Zeng

Data mining based social network

Firas Husseini

Social Network Analysis: Applications & Challenges

IIIT Hyderabad

Social network analysis course 2010 - 2011

guillaume ereteo

Social Network AnalysisGiorgos Cheliotis

Social Network Analysis Introduction including Data Structure Graph overview.

Doug Needham

Social Network Analysis

Fred Stutzman

LAK13 Tutorial Social Network Analysis 4 Learning Analytics

goehnert

Evolving social data mining and affective analysis

Athena Vakali

Social network analysis & Big Data - Telecommunications and moreWael Elrifai

02 Network Data Collection

dnac

Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...

BAINIDA

Social Network Analysis (SNA)

Development Innovations

05 Network Canvas (2017)

Duke Network Analysis Center

Social network analysis

World Agroforestry (ICRAF)

01 Introduction to Networks Methods and Measures

dnac

Overview Of Network Analysis PlatformsNoah Flower

Social Network Analysis (SNA) 2018

Arsalan Khan

What's hot (18)

Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...

Data mining based social network

Social Network Analysis: Applications & Challenges

Social network analysis course 2010 - 2011

Social Network Analysis

Social Network Analysis Introduction including Data Structure Graph overview.

Social Network Analysis

LAK13 Tutorial Social Network Analysis 4 Learning Analytics

Evolving social data mining and affective analysis

Social network analysis & Big Data - Telecommunications and more

02 Network Data Collection

Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...

Social Network Analysis (SNA)

05 Network Canvas (2017)

Social network analysis

01 Introduction to Networks Methods and Measures

Overview Of Network Analysis Platforms

Social Network Analysis (SNA) 2018

Similar to Data Cleaning for social media knowledge extraction

Top 10 Factors for Successful TDM Projects

Mary Ellen Bates

News construction from microblogging post using open data

Francisco Berrizbeitia

IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...

IRJET Journal

IRJET- Tweet Segmentation and its Application to Named Entity Recognition

IRJET Journal

Predicting cyber bullying on t witter using machine learning

MirXahid1

Abcd iqs ssoftware-projects-mercecrosas

Merce Crosas

Presentation for Harvard's ABCD Technology in Education group: The Institute for Quantitative Social Science (IQSS) is a unique entity at Harvard - it combines research, software development, and specialized services to provide innovative solutions to research and scholarship problems at Harvard and beyond. I will talk about the software projects that IQSS is currently working on (Dataverse, Zelig, Consilience, and OpenScholar), including the research and development processes, the benefits provided to the Harvard community, and the impacts on research and scholarship.

Sentence embedding to improve rumour detection performance model

IAESIJAI

Recently, most individuals have preferred accessing the most recent news via social media platforms like Twitter as their primary source of information. Moreover, Twitter enables users to post and distribute tweets quickly and unsupervised. As a result, Twitter has become a popular platform for disseminating false information, such as rumours. These rumours were then propagated as accurate and influenced public opinion and decision-making. The issue will arise when a decision or policy with substantial consequences is made based on rumours. To avoid the negative impacts of rumours, several researchers have attempted to detect them automatically as early as feasible. Previous studies employed supervised learning methods to identify Twitter rumours and relied on feature extraction algorithms to extract tweet content and context elements. However, manually extracting features is time-consuming and labour-intensive. To encode each tweet's sentence as a vector based on its contextual meaning, we proposed utilising Bidirectional Encoder Representation of Transformer (BERT) as a sentence embedding. We then used these vectors to train some classifier models to detect rumours. Finally, we compared the performance of BERT-based models to feature engineering-based models. We discovered that the suggested BERT-based model improved all parameters by around 10% compared to the feature engineering-based classification model.

Seda 10 dotHelen Webster

CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a

stark880qndustries

Identification and Analysis of Malicious Content on Facebook: A Survey

Cybersecurity Education and Research Centre

Online social media services like Facebook witness an exponential increase in user activity when an event takes place in the real world. This activity is a combination of good quality content like information, personal views, opinions, comments, as well as poor quality content like rumors, spam, and other malicious content. Although, the good quality content makes online social media a rich source of information, consumption of poor quality content can degrade user experience, and have inappropriate impact in the real world. In addition, the enormous popularity, promptness, and reach of online social media services across the world makes it essential to monitor this activity, and minimize the production and spread of poor quality content. Multiple studies in the past have analyzed the content spread on social networks during real world events. However, little work has explored the Facebook social network. Two of the main reasons for the lack of studies on Facebook are the strict privacy settings, and limited amount of data available from Facebook, as compared to Twitter. With over 1 billion monthly active users, Facebook is about five times bigger than its next biggest counterpart Twitter, and is currently, the largest online social network in the world. In this literature survey, we review the existing research work done on Facebook, and study the techniques used to identify and analyze poor quality content on Facebook, and other social networks. We also attempt to understand the limitations posed by Facebook in terms of availability of data for collection, and analysis, and try to understand if existing techniques can be used to identify and study poor quality content on Facebook.

Groundhog day: near duplicate detection on twitter

Dan Nguyen

Integrating technologies and digital literacy in ESOL

Nell Eckersley

Open domain Question Answering System - Research project in NLP

GVS Chaitanya

Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question

DP1_160430723010_Divya.pptx

DivyaPatel729457

Tom Healy Introduction

gueste6ee3e

Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx

richardnorman90310

Citation: Hisseine, M.A.; Chen, D.; Yang, X. The Application of Blockchain in Social Media: A Systematic Literature Review. Appl. Sci. 2022, 12, 6567. https://doi.org/ 10.3390/app12136567 Academic Editor: Federico Divina Received: 30 May 2022 Accepted: 27 June 2022 Published: 28 June 2022 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- iations. Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). applied sciences Review The Application of Blockchain in Social Media: A Systematic Literature Review Mahamat Ali Hisseine , Deji Chen * and Xiao Yang College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China; [email protected] (M.A.H.); [email protected] (X.Y.) * Correspondence: [email protected]; Tel.: +86-185-0172-4250 Abstract: Social media has transformed the mode of communication globally by providing an extensive system for exchanging ideas, initiating business contracts, and proposing new professional ideas. However, there are many limitations to the use of social media, such as misinformation, lack of effective content moderation, digital piracy, data breaches, identity fraud, and fake news. In order to address these limitations, several studies have introduced the application of Blockchain technology in social media. Blockchains can provides transparency, traceability, tamper-proofing, confidentiality, security, information control, and supervision. This paper is a systematic literature review of papers covering the application of Blockchain technology in social media. To the best of our knowledge, this is the first systematic literature review that elucidates the combination of Blockchain and social media. Using several electronic databases, 42 related papers were reviewed. Our findings show that previous studies on the applications of Blockchain in social media are focused mainly on blocking fake news and enhancing data privacy. Research in this domain began in 2017. This review additionally discusses several challenges in applying Blockchain technologies in social media contexts, and proposes alternative ideas for future implementation and research. Keywords: blockchain; social media; online network sites; application of blockchain 1. Introduction Social media invoke digital platforms reachable by the internet and permit users to form and interact in virtual groups. People can easily share information, which greatly strengthens communication and contact. They can find old classmates and acquaintances, connect with novel groups, or find persons with similar attractions across political, financial, and geographic boundaries. Thus, social media enable millions of internet users around the world to exchange infor.

Doctoral seminar (DBIS RWTH Aachen)Zina Petrushyna

fakenews_DBDA_Mar23.pptx

deepmitra8

Berlin 6 Open Access Conference: Tony Hey

Cornelius Puschmann

Tags, Networks, Narrative

Bruce Mason

Similar to Data Cleaning for social media knowledge extraction (20)

Top 10 Factors for Successful TDM Projects

News construction from microblogging post using open data

IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...

IRJET- Tweet Segmentation and its Application to Named Entity Recognition

Predicting cyber bullying on t witter using machine learning

Abcd iqs ssoftware-projects-mercecrosas

Sentence embedding to improve rumour detection performance model

Seda 10 dot

CXO Sakthi Presentation.pptxkjbk/jhlkasjdl;a

Identification and Analysis of Malicious Content on Facebook: A Survey

Groundhog day: near duplicate detection on twitter

Integrating technologies and digital literacy in ESOL

Open domain Question Answering System - Research project in NLP

DP1_160430723010_Divya.pptx

Tom Healy Introduction

Citation Hisseine, M.A.; Chen, D.;Yang, X. The Applicatio.docx

Doctoral seminar (DBIS RWTH Aachen)

fakenews_DBDA_Mar23.pptx

Berlin 6 Open Access Conference: Tony Hey

Tags, Networks, Narrative

More from Marco Brambilla

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...

Marco Brambilla

Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...

Marco Brambilla

Hierarchical Transformers for User Semantic Similarity - ICWE 2023

Marco Brambilla

We discuss the use of hierarchical transformers for user semantic similarity in the context of analyzing users' behavior and profiling social media users. The objectives of the research include finding the best model for computing semantic user similarity, exploring the use of transformer-based models, and evaluating whether the embeddings reflect the desired similarity concept and can be used for other tasks. We use a large dataset of Twitter users and apply an automatic labeling approach. The dataset consists of English tweets posted in November and December 2020, totaling about 27GB of compressed data. Preprocessing steps include filtering out short texts, cleaning user connections, and selecting a benchmark set of users for evaluation. Since Transformer architectures are known to work well on short text, we cannot use them on extensive collections of tweets describing the activity of a user. Therefore, we propose a hierarchical structure of transformer models to be used first on tweets and then on their aggregations. The models used in the study include hierarchical transformers, and the tweet embeddings are obtained using four Transformer-based models: RoBERTa2, BERTweet3, Sentence BERT4, and Twitter4SSE5. The researchers test different techniques for processing tweet embeddings to generate accurate user embeddings, including mean pooling, recurrence over BERT (RoBERT), and transformer over BERT (ToBERT). The evaluation of the models is done on a set of 5,000 users, comparing user similarities with 30 other candidate users, 5 of which are considered similar and 25 considered dissimilar. The evaluation metrics used include mean average precision (MAP), mean reciprocal rank (MRR) at 10, and normalized discounted cumulative gain (nDCG). The optimization process involves selecting a loss function and using the AdamW optimizer with specific hyperparameters. The results show that the hierarchical approach with a Stage-1 Twitter4SSE model and a Stage-2 Transformer model performs the best among the alternatives. In conclusion, the research provides a large unbiased dataset for user similarity analysis, presents a hierarchical language model optimized for accurate user similarity computation, and validates the models' performance on similarity tasks, with potential applications to related problems. The future work includes investigating the impact of time and topic drift on the models' performance.

Exploring the Bi-verse.A trip across the digital and physical ecospheres

Marco Brambilla

The Web and social media are the environments where people post their content, opinions, activities, and resources. Therefore, a considerable amount of user-generated content is produced every day for a wide variety of purposes. On the other side, people live their everyday life immersed in the physical world, where society, economy, politics and personal relations continuously evolve. These two opposite and complementary environment are today fully integrated: they reflect each other and they interact with each other in a stronger and stronger way. Exploring and studying content and data coming from both environments offers a great opportunity to understand the ever evolving modern society, in terms of topics of interest, events, relations, and behaviour. In this speech I will discuss through business cases and socio-political scenarios how we can extract insights and understand reality by combining and analyzing data from the digital and physical world, so as to reach a better overall picture of reality itself. Along this path, we need to keep into account that reality is complex and varies in time, space and along many other dimensions, including societal and economic variables. The speech highlights the main challenges that need to be addressed and outlines some data science strategies that can be applied to tackle these specific challenges. This slide deck has been presented as a keynote speech at WISE 2022 in Biarritz, France.

Trigger.eu: Cocteau game for policy making - introduction and demo

Marco Brambilla

Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...

Marco Brambilla

A large audience of users and typically a long time frame are needed to produce sensible and useful log data, making it an expensive task. To address this limit, we propose a method that focuses on the generation of REALISTIC NAVIGATIONAL PATHS, i.e., web logs . Our approach is extremely relevant because it can at the same time tackle the problem of lack of publicly available data about web navigation logs, and also be adopted in industry for AUTOMATIC GENERATION OF REALISTIC TEST SETTINGS of Web sites yet to be deployed. The generation has been implemented using deep learning methods for generating more realistic navigation activities, namely Recurrent Neural Network, which are very well suited to temporally evolving data Generative Adversarial Network: neural networks aimed at generating new data, such as images or text, very similar to the original ones and sometimes indistinguishable from them, that have become increasingly popular in recent years. We run experiments using open data sets of weblogs as training, and we run tests for assessing the performance of the methods. Results in generating new weblog data are quite good with respect to the two evaluation metrics adopted (BLEU and Human evaluation). Our study is described in detail in the paper published at ICWE 2020 – International Conference on Web Engineering with DOI: 10.1007/978-3-030-50578-3. It’s available online on the Springer Web site.

Analyzing rich club behavior in open source projects

Marco Brambilla

The network of collaborations in an open source project can reveal relevant emergent properties that influence its prospects of success. In this work, we analyze open source projects to determine whether they exhibit a rich-club behavior, i.e., a phenomenon where contributors with a high number of collaborations (i.e., strongly connected within the collaboration network) are likely to cooperate with other well-connected individuals. The presence or absence of a rich-club has an impact on the sustainability and robustness of the project. For this analysis, we build and study a dataset with the 100 most popular projects in GitHub, exploiting connectivity patterns in the graph structure of collaborations that arise from commits, issues and pull requests. Results show that rich-club behavior is present in all the projects, but only few of them have an evident club structure. We compute coefficients both for single source graphs and the overall interaction graph, showing that rich-club behavior varies across different layers of software development. We provide possible explanations of our results, as well as implications for further analysis.

Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...

Marco Brambilla

Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...

Marco Brambilla

Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage ``better" driving behaviour through immediate feedback while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style. In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour. This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance policies and car rentals. Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined groundtruth on drivers classification. The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.

Myths and challenges in knowledge extraction and analysis from human-generate...

Marco Brambilla

For centuries, science (in German "Wissenschaft") has aimed to create ("schaften") new knowledge ("Wissen") from the observation of physical phenomena, their modelling, and empirical validation. Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted. The challenge is once again to capture and create knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data). The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to. The talk explores the mixed realistic-utopian domain of knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.

Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...

Marco Brambilla

Knowledge bases like DBpedia, Yago or Google's Knowledge Graph contain huge amounts of ontological knowledge harvested from (semi-)structured, curated data sources, such as relational databases or XML and HTML documents. Yet, the Web is full of knowledge that is not curated and/or structured and, hence, not easily indexed, for example social data. Most work so far in this context has been dedicated to the extraction of entities, i.e., people, things or concepts. This poster describes our work toward the extraction of relationships among entities. The objective is reconstructing a typed graph of entities and relationships to represent the knowledge contained in social data, without the need for a-priori domain knowledge. The experiments with real datasets show promising performance across a variety of domains. The key distinguishing feature of the work is its focus on highly unstructured social data (tweets and Facebook posts) without reliable grammar structures. Traditional relation extraction approaches supervised , semi-supervised or unsupervised, commonly assume the availability of grammatically correct language corpora.

Model-driven Development of User Interfaces for IoT via Domain-specific Comp...

Marco Brambilla

Internet of Things technologies and applications are evolving and continuously gaining traction in all fields and environments, including homes, cities, services, industry and commercial enterprises. However, still many problems need to be addressed. For instance, the IoT vision is mainly focused on the technological and infrastructure aspect, and on the management and analysis of the huge amount of generated data, while so far the development of front-end and user interfaces for IoT has not played a relevant role in research. On the contrary, user interfaces in the IoT ecosystem they can play a key role in the acceptance of solutions by final adopters. In this paper we present a model-driven approach to the design of IoT interfaces, by defining a specific visual design language and design patterns for IoT\ applications, and we show them at work. The language we propose is defined as an extension of the OMG standard language called IFML.

A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.

Marco Brambilla

Consumer-centered software applications nowadays are required to be available both as mobile and desktop versions. However, the app design is frequently made only for one of the two (i.e., mobile first or web first) while missing an appropriate design for the other (which, in turn, simply mimics the interaction of the first one). This results into poor quality of the interaction on one or the other platform. Current solutions would require different designs, to be realized through different design methods and tools, and that may require to double development and maintenance costs. In order to mitigate such an issue, this paper proposes a novel approach that supports the design of both web and mobile applications at once. Starting from a unique requirement and business specification, where web– and mobile–specific aspects are captured through tagging, we derive a platform independent design of the system specified in IFML. This model is subsequently refined and detailed for the two platforms, and used to automatically generate both the web and mobile versions. If more precise interactions are needed for the mobile part, a blending with MobML, a mobile-specific modeling language, is devised. Full traceability of the relations between artifacts is granted.

Big Data and Stream Data Analysis at Politecnico di Milano

Marco Brambilla

Web Science. An introduction

Marco Brambilla

The Web Science course focuses on the study of large-scale socio-technical systems associated with the World Wide Web. It considers the relationship between people and technology, the ways that society and technology complement one another and the way they impact on broader society. These analyses are inherently associated with Big Data management issues. The course is organised in four parts. 1. Syntax In the first part, the course introduces the basis of content analysis. If focuses on the syntactic aspects, covering the fundamentals of natural language processing and text mining. It describes the structure and typical characteristics of the different web sources, spanning search results, social media contents, social network structures, Web APIs, and so on. It also provides an overview of the basic Web analysis techniques applied in Web search and Web recommendation. 2. Semantics In the second part, the course presents semantic technologies. These technologies are very important nowadays because they allow to treat the "variety" dimension of Big Data, i.e., they enable integration of multiple and diverse sources of information, which is typical on the modern Web platform. Covered topics include: - RDF - a flexible data model to represent heterogeneous data - OWL - a flexible ontological language to model heterogeneous data sources - SPARQL - a query language for RDF. It shows how to put all the pieces together in order to achieve interoperability among heterogeneous information sources 3. Time The third part covers the realm of temporal-dependent data. The topics covered here allow to treat the "velocity" dimension of Big Data. It shows the importance for many Big Data analysis scenarios to process data stream, coming for instance from Internet of Things (IoT) and Social Media sources; and it describes how to apply semantic and syntactic techniques in the context of time-dependent information. For instance, it shows how to extend RDF to model RDF streams, how to extend SPARQL to continuously process RDF streams and how to reason on those RDF Streams 4. Applications In the fourth part, the course focuses on specific application scenarios and presents the typical settings and problems where the presented techniques can be applied. This part discusses settings such as: big data analysis for smart cities; data analytics for brand monitoring (marketing) and event monitoring; data analysis for trend detection and user engagement; and so on.

On the Quest for Changing Knowledge. Capturing emerging entities from social ...

Marco Brambilla

Massive data integration technologies have been recently used to produce very large ontologies. However, knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail. Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately reflects the relevant changes which hide emerging entities. Thus, we propose a method for discovering emerging entities by extracting them from social content. We start from a purely-syntactic method as a baseline, and we propose a semantics-based method based on entity recognition and DBpedia. The method associates candidate entities to feature vectors, built from social content by using co-occurrence, and then extracts the emerging entities by using feature similarity measures. Once instrumented by experts through very simple initialization, the method is capable of finding emerging entities and extracting their relevant relationships to given types; the method can be continuously or periodically iterated, using the already identified emerged knowledge as new starting point. We validate our method by applying it to a set of diverse domain-specific application scenarios, spanning fashion, literature, exhibitions and so on. We show the approach at work and we demonstrate its effectiveness on datasets with different characterization in terms of coverage, dynamics and size.

Studying Multicultural Diversity of Cities and Neighborhoods through Social M...

Marco Brambilla

Cities are growing as melting pots of people with different culture, religion, and language. In this paper, through multilingual analysis of Twitter contents shared within a city, we analyze the prevalent language in the different neighborhoods of the city and we compare the results with census data, in order to highlight any parallelisms or discrepancies between the two data sources. We show that the officially identified neighborhoods are actually representing significantly different communities and that the use of the social media as a data source helps to detect those weak signals that are not captured from traditional data.

Model driven software engineering in practice book - Chapter 9 - Model to tex...

Marco Brambilla

Slides for the mdse-book.com chapter 9 - Model-to-text transformations. Complete set of slides now available: Chapter 1 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-1-introduction Chapter 2 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-2-mdse-principles Chapter 3 - http://www.slideshare.net/jcabot/model-driven-software-engineering-in-practice-chapter-3-mdse-use-cases Chapter 4 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-4 Chapter 5 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-5-integration-of-modeldriven-in-development-processes Chapter 6 - http://www.slideshare.net/jcabot/mdse-bookslideschapter6 Chapter 7 - http://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-7-developing-your-own-modeling-language Chapter 8 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-8-modeltomodel-transformations Chapter 9 - https://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-9-model-to-text-transformations-and-code-generation Chapter 10 - http://www.slideshare.net/jcabot/mdse-bookslideschapter10managingmodels This book discusses how approaches based on modeling can improve the daily practice of software professionals. This is known as Model-Driven Software Engineering (MDSE) or, simply, Model-Driven Engineering (MDE). MDSE practices have proved to increase efficiency and effectiveness in software development. MDSE adoption in the software industry is foreseen to grow exponentially in the near future, e.g., due to the convergence of software development and business analysis. This book is an agile and flexible tool to introduce you to the MDE and MDSE world, thus allowing you to quickly understand its basic principles and techniques and to choose the right set of MDE instruments for your needs so that you can start to benefit from MDE right away. The book is organized into two main parts. The first part discusses the foundations of MDSE in terms of basic concepts (i.e., models and transformations), driving principles, application scenarios and current standards, like the wellknown MDA initiative proposed by OMG (Object Management Group) as well as the practices on how to integrate MDE in existing development processes. The second part deals with the technical aspects of MDSE, spanning from the basics on when and how to build a domain-specific modeling language, to the description of Model-to-Text and Model-to-Model transformations, and the tools that support the management of MDE projects. The book covers a wide set of introductory and technical topics, spanning MDE at large, definitions and orientation in the MD* world, metamodeling, domain specific languages, model transformations, reverse engineering, OMG's MDA, UML, OCL, A

Model driven software engineering in practice book - chapter 7 - Developing y...

Marco Brambilla

Slides for the mdse-book.com - Chapter 7: Developing Your Own Modeling Language. Complete set of slides now available: Chapter 1 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-1-introduction Chapter 2 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-2-mdse-principles Chapter 3 - http://www.slideshare.net/jcabot/model-driven-software-engineering-in-practice-chapter-3-mdse-use-cases Chapter 4 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-4 Chapter 5 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-5-integration-of-modeldriven-in-development-processes Chapter 6 - http://www.slideshare.net/jcabot/mdse-bookslideschapter6 Chapter 7 - http://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-7-developing-your-own-modeling-language Chapter 8 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-8-modeltomodel-transformations Chapter 9 - https://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-9-model-to-text-transformations-and-code-generation Chapter 10 - http://www.slideshare.net/jcabot/mdse-bookslideschapter10managingmodels This book discusses how approaches based on modeling can improve the daily practice of software professionals. This is known as Model-Driven Software Engineering (MDSE) or, simply, Model-Driven Engineering (MDE). MDSE practices have proved to increase efficiency and effectiveness in software development. MDSE adoption in the software industry is foreseen to grow exponentially in the near future, e.g., due to the convergence of software development and business analysis. This book is an agile and flexible tool to introduce you to the MDE and MDSE world, thus allowing you to quickly understand its basic principles and techniques and to choose the right set of MDE instruments for your needs so that you can start to benefit from MDE right away. The first part discusses the foundations of MDSE in terms of basic concepts (i.e., models and transformations), driving principles, application scenarios and current standards, like the wellknown MDA initiative proposed by OMG (Object Management Group) as well as the practices on how to integrate MDE in existing development processes. The second part deals with the technical aspects of MDSE, spanning from the basics on when and how to build a domain-specific modeling language, to the description of Model-to-Text and Model-to-Model transformations, and the tools that support the management of MDE projects. The book covers the MD* world, metamodeling, domain specific languages, model transformations, reverse engineering, OMG's MDA, UML, OCL, ATL, QVT, MOF, Eclipse, EMF, GMF, TCS, xText.

Automatic code generation for cross platform, multi-device mobile apps. An in...

Marco Brambilla

This presentation was given at the MobileDeLi workshop 2015 collocated with the Splash 2015 conference. With the continuously increasing adoption of mobile devices, software development companies have new business opportunities through direct sales in app stores and delivery of business to employee (B2E) and business to business (B2B) solutions. However, cross-platform and multi-device development is a barrier for today's IT solution providers, especially small and medium enterprises (SMEs), due to the high cost and technical complexity of targeting development to a wide spectrum of devices, which dier in format, interaction paradigm, and software architecture. So far, several authors have proposed the application of model driven approaches to mobile apps development following a variety of strategies. In this paper we present the results of a research study conducted to nd the best strategy for WebRatio, a software development company, interested in producing a MDD tool for designing and developing mobile apps to enter the mobile apps market. We report on a comparative study conducted to identify the best trade-os between various automatic code generation approaches.

More from Marco Brambilla (20)

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...

Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...

Hierarchical Transformers for User Semantic Similarity - ICWE 2023

Exploring the Bi-verse.A trip across the digital and physical ecospheres

Trigger.eu: Cocteau game for policy making - introduction and demo

Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...

Analyzing rich club behavior in open source projects

Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...

Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...

Myths and challenges in knowledge extraction and analysis from human-generate...

Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...

Model-driven Development of User Interfaces for IoT via Domain-specific Comp...

A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.

Big Data and Stream Data Analysis at Politecnico di Milano

Web Science. An introduction

On the Quest for Changing Knowledge. Capturing emerging entities from social ...

Studying Multicultural Diversity of Cities and Neighborhoods through Social M...

Model driven software engineering in practice book - Chapter 9 - Model to tex...

Model driven software engineering in practice book - chapter 7 - Developing y...

Automatic code generation for cross platform, multi-device mobile apps. An in...

Recently uploaded

LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO

lorraineandreiamcidl

SluggerPunk Final Angel Investor Proposal

grogshiregames

Social Media Marketing Strategies .

Virtual Real Design

Unlock TikTok Success with Sociocosmos..

SocioCosmos

Surat Digital Marketing School - course curriculum

digitalcourseshop4

Surat Digital Marketing School is created to offer a complete course that is specifically designed as per the current industry trends. Years of experience has helped us identify and understand the graduate-employee skills gap in the industry. At our school, we keep up with the pace of the industry and impart a holistic education that encompasses all the latest concepts of the Digital world so that our graduates can effortlessly integrate into the assigned roles. This is the place where you become a Digital Marketing Expert.

Your Path to YouTube Stardom Starts Here

SocioCosmos

Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...

SocioCosmos

Multilingual SEO Services | Multilingual Keyword Research | Filose

madisonsmith478075

Multilingual SEO services are essential for businesses aiming to expand their global presence. They involve optimizing a website for search engines in multiple languages, enhancing visibility, and reaching diverse audiences. Filose offers comprehensive multilingual SEO services designed to help businesses optimize their websites for search engines in various languages, enhancing their global reach and market presence. These services ensure that your content is not only translated but also culturally and contextually adapted to resonate with local audiences. Visit us at -https://www.filose.com/

SluggerPunk Angel Investor Final Proposal

grogshiregames

“To be integrated is to feel secure, to feel connected.” The views and experi...

AJHSSR Journal

ABSTRACT: Although a significant amount of literature exists on Morocco's migration policies and their successes and failures since their implementation in 2014, there is limited research on the integration of subSaharan African children into schools. This paperis part of a Ph.D. research project that aims to fill this gap. It reports the main findings of a study conducted with migrant children enrolled in two public schools in Rabat, Morocco, exploring how integration is defined by the children themselves and identifying the obstacles that they have encountered thus far. The following paper uses an inductive approach and primarily focuses on the relationships of children with their teachers and peers as a key aspect of integration for students with a migration background. The study has led to several crucial findings. It emphasizes the significance of speaking Colloquial Moroccan Arabic (Darija) and being part of a community for effective integration. Moreover, it reveals that the use of Modern Standard Arabic as the language of instruction in schools is a source of frustration for students, indicating the need for language policy reform. The study underlines the importanceof considering the children‟s agency when being integrated into mainstream public schools. . KEYWORDS: migration, education, integration, sub-Saharan African children, public school

7 Tips on Social Media Marketing strategy

Digital Marketing Lab

Grow Your Reddit Community Fast.........

SocioCosmos

Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...

AJHSSR Journal

ABSTRACT: In the Malaysian context, small and medium enterprises (SMEs) experience a significant burden of workplace accidents. A consensus among scholars attributes a substantial portion of these incidents to human factors, particularly unsafe behaviors. This study, conducted in Malaysia's northern region, specifically targeted Safety and Health/Human Resource professionals within the manufacturing sector of SMEs. We gathered a robust dataset comprising 107 responses through a meticulously designed self-administered questionnaire. Employing advanced partial least squares-structural equation modeling (PLS-SEM) techniques with SmartPLS 3.2.9, we rigorously analyzed the data to scrutinize the intricate relationship between safety behavior and safety performance. The research findings unequivocally underscore the palpable and consequential impact of safety behavior variables, namely safety compliance and safety participation, on improving safety performance indicators such as accidents, injuries, and property damages. These results strongly validate research hypotheses. Consequently, this study highlights the pivotal significance of cultivating safety behavior among employees, particularly in resource-constrained SME settings, as an essential step toward enhancing workplace safety performance. KEYWORDS :Safety compliance, safety participation, safety performance, SME

Recently uploaded (13)

LORRAINE ANDREI_LEQUIGAN_HOW TO USE TRELLO

SluggerPunk Final Angel Investor Proposal

Social Media Marketing Strategies .

Unlock TikTok Success with Sociocosmos..

Surat Digital Marketing School - course curriculum

Your Path to YouTube Stardom Starts Here

Buy Pinterest Followers, Reactions & Repins Go Viral on Pinterest with Socio...

Multilingual SEO Services | Multilingual Keyword Research | Filose

SluggerPunk Angel Investor Final Proposal

“To be integrated is to feel secure, to feel connected.” The views and experi...

7 Tips on Social Media Marketing strategy

Grow Your Reddit Community Fast.........

Improving Workplace Safety Performance in Malaysian SMEs: The Role of Safety ...

Data Cleaning for social media knowledge extraction

1. Emre Calisir, Marco Brambilla KDWEB2018, Cáceres, Spain The Problem of Data Cleaning for Knowledge Extraction from Social Media

2. Knowledge Extraction from Social Media is a Need

3. Keyword or hash-tag based filtering is insufficient

4. Is it possible to extract a sub-selection of content items if and only if they are actually relevant to the topic or context of interest ?

5. Examples to Related Studies 1. Earthquake Alarm System, Sakaki et al. Proc. of the 19th int.conference on WWW, 2010 2. Detection of influenza-like illnesses Culotta. Proc. of the 1st workshop on social media analytics, Washington, D.C, 2010 3. Discovering health topics, Paul & Dredze, PLoS ONE 9, e103408, 2014 4. Detection of prescription medication abuse, Sarker et al. Drug Safety, 2016 5. Tracking baseball and fashion topics, Lin et al. KDD, 2011 6. Event detection system, Kunneman & Bosch, BNAIC, 2014 7. Credibility of trend-topic hashtag usage Castillo et al, Proc. of the 20th int. Conf. on WWW. ACM, Hyderabad, USA, 2011 8. Non-relevant tweet filtering, Hajjem & Latiri, Procedia computer science, 112, 2017

6. Supervised learning trained on annotated data could help us

7. Overview Topic Relevancy Detection Machine Topic Relevant Dataset

8. Proposed Data Cleaning Method for Knowledge Extraction

9. Use Case CulturalInstitutions ofItaly

10. Best #Hotel Deals in #Pompei #HotelDegliAmiciPompei starting at EUR99.60 https://t.co/5DxkKn4o69 Pompei Hero Pliny the Elder May Have Been Found 2000 Years Later https://t.co/PyR2rP1Xpe #2017Rewind #archeology #history #Pompei #rome #RomanEmpire Non-Relevant Tweet Relevant Tweet

11. 4 feature extraction strategies evaluated N-grams (unigrams, bigrams, trigrams) Word2Vec Word2Vec + additional tweet features Dimensionality Reduction with PCA

12. Annotated Data 726 tweets. Contains tweet having specific hashtags and keywords related to Pompei, Colosseo and Teatro Alla Scala The data contains 50% relevant and 50% non relevant.

13. Model 1: Text transformation to ngrams # of [unigrams, bigrams, trigrams] : [494,287,228] vocabulary size: 1009 words Example:

14. Model 2: Text transformation to word2vec • Word2vec dimension is selected as 25. • Word2vec vocabulary is built with 12K unlabeled tweets • Preprocessing operations before building word2vec model • Convert to lowercase, • Discard • Web links • Words with character size < 3 • Stopwords are eliminated before model building.

15. Model 2: Text transformation to word2vec

16. Model 2: Text transformation to word2vec

17. Model 3: word2vec + Additional Features Tweet Author Full text: #nuovi #corsi #inglese #settembre #pompei #chiamaci per #info https://t.co/QRrXlMC0g1 Number of Friends: 4 Number of Followers: 9 Number of Lists: 15 Number of Favourited Tweets: 0 Language: en Number of Tweets: 4220 Source: PostPickr Verified Account: False Number of Favorited: 0 Geo Enabled: False Number of Retweets: 0 Default Profile: False Example:

18. Model 4: PCA applied on Model 3

19. Model 1 2 3 4 Accuracy 0.84 0.81 0.82 0.83 Precision 0.84 0.78 0.83 0.84 Recall 0.83 0.86 0.8 0.81 F1 0.83 0.82 0.81 0.82 Model 1: ngrams Model 2: word2vec Model 3: word2vec + additional features Model 4: PCA applied on Model 3 10fold Cross-Validated Results

20. Conclusions Supervised Machine Learning techniques could help to obtain topic relevant social media data

21. Collecting more data to build larger Word2Vec Vocabulary New Use Cases Challenges ahead

22. THANKS! QUESTIONS? Emre Calisir, Marco Brambilla The Problem of Data Cleaning for Knowledge Extraction from Social Media Marco Brambilla @marcobrambi marco.brambilla@polimi.it http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Editor's Notes

500 million tweets each day. 100 million active daily people. People tend to share every part of their lives, opinions... Every sector wants to extract knowledge from social media: banking, health, tourism...
Data is created by humans. Very open to errors. Words can have multiple meanings. Rule-based filtering systems fail. (we mean by Rule-based systems  systems that collect tweets just using queries) They bring related and unrelated content together.
Our research question.
All of them are based on Twitter data Various use cases, various machine learning techniques Supervised (1 to 7) SVM, Multinomial Bayes, Logistic Regression, Decision Trees, Bayes Networks, Unsupervised (Last one) Latent Dirichlet Allocation We benefit from these studies. Early Alert System. Technique: SVM classifier. Features: keywords and the context of target-event words. Detection of Illnesses. Technique: Log Regr. Bag-of-Words Filter-out Non-Relevant Content. Technique: Log. Regr. Unigrams-Bigrams-Trigrams. Very similar to first model in our study. Discovery of patterns of abuse of specific medications. Topic tracking on tweet streams. Again unigrams Discard noisy data for an event detection system based on Twitter stream data. Assesment of credibility of trend-topics, to check if they are really related with the topic of used hash-tag, or spam. They tried SVM, decision trees and Bayes. Unsupervised technique. Pooling method using Information Retrieval and Latent Dirichlet Allocation.
Annotated dataset, labeled as Relevant or Non-Relevant with the topic. We train SVM based model with labeled data. We predict the new unlabeled data. We selected Support Vector Machines with Linear Kernel. The recommended approach when there is sparse vector features.
Our basic approach to obtain a relevant dataset. We applied with Twitter but is also applicable to any kind of social media data.
And a more detailed illustration to general flow. Initially we build our model. And then we make the prediction. Finally we have a clean dataset.
Pompei, Colosseo and Teatro Alla Scala
Imagine that we have a social media monitoring tool. We want to track tweets related historical value of Pompei. How to do it? Label relevant and non-relevant data. Who will label it? Subject experts
Subject experts are labeled the tweets as relevant and non relevant. A random guess could predict with 0.5 accuracy.
Ngrams is a widely used technique in text classification. It is an example to how we transform a tweet to feature vector. Also we applied Tf-idf to increase accuracy. It is a best-practive in ngrams usage.
Default word2vec dimension is 100. However, our dataset is limited. We performed analysis and achieved better results by representing with 25 features. Larger the word2vec vocabulary, better creating the semantic relations.
Top 3 Similarities showed for specific words. Inside parantheses, there is vectorial distance btw words. For a given tweet, we calculate the average value of its word vectors.
Another illustration of trained word2vec model. This graph is generated with few words to show you just the vectorial representations of words.
Feature Extraction Strategy Vector features: Word2Vec Numerical features: MinMaxScaler Categorical features: OneHotEncoder Dimension of features after transformation: dim(Word2Vec) = 25 -- dim(numerical features) = 7 -- dim(one-hot-encoding) = 68 -- Total number of features = 125
The target is to analyze impact of dimensionality reduction. It could improve model accuracy. Graphic: Desired dimension < 40  the model is underfitting Desired dimension > 40  the model is overfitting We selected dimension size : 40
We preferred cross-validation because size of our labeled dataset is limited. (726 tweets) All the classification models are succesful. Dotted line shows random guess. (50% relevant and 50% non relevant tweets exist inside data) Model 1 has best performance. Model 4 has second best performance. If we have a larger word2vec vocabulary, word2vec could have better accuracy then ngrams. ROC curve and AUC scores prove the performance of our model.
We addressed our research question. We proved that text classsification is a very convenient way to obtain topic relevant data.
We still a way to go for building a more accurate system. We now create a larger dataset. And we will find out new use cases, also open datasets to compare our results with the literature. Also, we will try other algorithms, not only SVM. Also, we can bring external data by importing content from given weblinks.

Data Cleaning for social media knowledge extraction

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Data Cleaning for social media knowledge extraction

Similar to Data Cleaning for social media knowledge extraction (20)

More from Marco Brambilla

More from Marco Brambilla (20)

Recently uploaded

Recently uploaded (13)

Data Cleaning for social media knowledge extraction

Editor's Notes