This document proposes a method to analyze communities on social networks using graph representation learning. It involves collecting data on brands and followers from Instagram, constructing graphs to model interactions, extracting embeddings using node2vec, classifying users, and clustering communities. Experiments on an Italian fashion brand found embeddings from reduced graphs performed well in classification. Clustering identified sub-communities validated by domain experts as related to professionals, holidays, and regular users. The method effectively analyzed social network communities through network modeling and representation learning.
Iterative knowledge extraction from social networks. The Web Conference 2018Marco Brambilla
Knowledge in the world continuously evolves, and ontologies are largely incomplete, especially regarding data belonging to the so-called long tail. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds.
In this paper we address the following research questions: (1) How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds (2) How does the reconstructed domain knowledge spread geographically (3) Can the method be used to inspect the past, present, and future of knowledge (4) Can the method be used to find emerging knowledge?.
This work was presented at The Web Conference 2018, MSM workshop.
In online social media platforms, users can express their ideas by posting original content or by adding comments and responses to existing posts, thus generating virtual discussions and conversations. Studying these conversations is essential for understanding the online communication behavior of users. This study proposes a novel approach to retrieve popular patterns on online conversations using network-based analysis. The analysis consists of two main stages: intent analysis and network generation. Users’ intention is detected using keyword-based categorization of posts and comments, integrated with classification through Naïve Bayes and Support Vector Machine algorithms for uncategorized comments. A continuous human-in-the-loop approach further improves the keyword-based classification. To build and understand communication patterns among the users, we build conversation graphs starting from the hierarchical structure of posts and comments, using a directed multigraph network. The experiments categorize 90% comments with 98% accuracy on a real social media dataset. The model then identifies relevant patterns in terms of shape and content; and finally determines the relevance and frequency of the patterns. Results show that the most popular online discussion patterns obtained from conversation graphs resemble real-life interactions and communication.
Data Cleaning for social media knowledge extractionMarco Brambilla
Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.
However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.
We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest. We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.
For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.
Social network analysis for modeling & tuning social media websiteEdward B. Rockower
Social Network Analysis of a Professional Online Social Media Collaboration Community. Tuning and optimizing based on observed social network dynamics and user behavior.
Iterative knowledge extraction from social networks. The Web Conference 2018Marco Brambilla
Knowledge in the world continuously evolves, and ontologies are largely incomplete, especially regarding data belonging to the so-called long tail. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds.
In this paper we address the following research questions: (1) How does the reconstructed domain knowledge evolve if the candidates of one extraction are recursively used as seeds (2) How does the reconstructed domain knowledge spread geographically (3) Can the method be used to inspect the past, present, and future of knowledge (4) Can the method be used to find emerging knowledge?.
This work was presented at The Web Conference 2018, MSM workshop.
In online social media platforms, users can express their ideas by posting original content or by adding comments and responses to existing posts, thus generating virtual discussions and conversations. Studying these conversations is essential for understanding the online communication behavior of users. This study proposes a novel approach to retrieve popular patterns on online conversations using network-based analysis. The analysis consists of two main stages: intent analysis and network generation. Users’ intention is detected using keyword-based categorization of posts and comments, integrated with classification through Naïve Bayes and Support Vector Machine algorithms for uncategorized comments. A continuous human-in-the-loop approach further improves the keyword-based classification. To build and understand communication patterns among the users, we build conversation graphs starting from the hierarchical structure of posts and comments, using a directed multigraph network. The experiments categorize 90% comments with 98% accuracy on a real social media dataset. The model then identifies relevant patterns in terms of shape and content; and finally determines the relevance and frequency of the patterns. Results show that the most popular online discussion patterns obtained from conversation graphs resemble real-life interactions and communication.
Data Cleaning for social media knowledge extractionMarco Brambilla
Social media platforms let users share their opinions through textual or multimedia content. In many settings, this becomes a valuable source of knowledge that can be exploited for specific business objectives. Brands and companies often ask to monitor social media as sources for understanding the stance, opinion, and sentiment of their customers, audience and potential audience. This is crucial for them because it let them understand the trends and future commercial and marketing opportunities.
However, all this relies on a solid and reliable data collection phase, that grants that all the analyses, extractions and predictions are applied on clean, solid and focused data. Indeed, the typical topic-based collection of social media content performed through keyword-based search typically entails very noisy results.
We recently implemented a simple study aiming at cleaning the data collected from social content, within specific domains or related to given topics of interest. We propose a basic method for data cleaning and removal of off-topic content based on supervised machine learning techniques, i.e. classification, over data collected from social media platforms based on keywords regarding a specific topic. We define a general method for this and then we validate it through an experiment of data extraction from Twitter, with respect to a set of famous cultural institutions in Italy, including theaters, museums, and other venues.
For this case, we collaborated with domain experts to label the dataset, and then we evaluated and compared the performance of classifiers that are trained with different feature extraction strategies.
Social network analysis for modeling & tuning social media websiteEdward B. Rockower
Social Network Analysis of a Professional Online Social Media Collaboration Community. Tuning and optimizing based on observed social network dynamics and user behavior.
An introduction in the world of Social Network Analysis and a view on how this may help learning networks. History, data collection and several analysis techniques are shown.
LAK13 Tutorial Social Network Analysis 4 Learning Analyticsgoehnert
Slides of the tutorial "Computational Methods and Tools for Social Network Analysis Networked Learning Communities" at the LAK 2013 in Leuven.
Tutorial Homepage:
http://snatutoriallak2013.ku.de/index.php/SNA_tutorial_at_LAK_2013
Conference Homepage:
http://lakconference2013.wordpress.com/
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories.
Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing.
I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.
1. Basics of Social Networks
2. Real-world problem
3. How to construct graph from real-world problem?
4. What graph theory problem getting from real-world problem?
5. Graph type of Social Networks
6. Special properties in social graph
7. How to find communities and groups in social networks? (Algorithms)
8. How to interpret graph solution back to real-world problem?
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Denis Parra Santander
- First version was a guest lecture about Network Visualization in the class "Data Visualization" taught by Dr. Sharon Hsiao in the QMSS program at Columbia University http://www.columbia.edu/~ih2240/dataviz/index.htm
- This updated version was delivered in our class on SNA at PUC Chile in the MPGI master program.
How to conduct a social network analysis: A tool for empowering teams and wor...Jeromy Anglim
Slides and details available at: http://jeromyanglim.blogspot.com/2009/10/how-to-conduct-social-network-analysis.html
A talk on using social network analysis as a team development tool.
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIRuchika Sharma
This report is done as a part in completion of our Big Data Analysis Course at Jindal Global Business School.
In this report, we have mainly focused on literature review of 10 use-cases in the visualization task. We have worked on use cases pertaining to varied use of social media site Twitter in the political, cultural and business context; use by drug marketers and musicians among others.
Predicting Social Interactions from Different Sources of Location-based Knowl...Michael Steurer
Recent research has shown that digital online geo- location traces are new and valuable sources to predict social interactions between users, e.g. , check-ins via FourSquare or geo-location information in Flickr images. Interestingly, if we look at related work in this area, research studying the extent to which social interactions can be predicted between users by taking more than one location-based knowledge source into account does not exist. To contribute to this field of research, we have collected social interaction data of users in an online social network called My Second Life and three related location-based knowledge sources of these users (monitored locations, shared locations and favored locations), to show the extent to which social interactions between users can be predicted. Using supervised and unsupervised machine learning techniques, we find that on the one hand the same location-based features (e.g. the common regions and common observations) perform well across the three different sources. On the other hand, we find that the shared location information is better suited to predict social interactions between users than monitored or favored location information of the user.
2013 NodeXL Social Media Network AnalysisMarc Smith
Social media network analysis and visualization with NodeXL - the network overview discovery and exploration add-in for Excel. Map Twitter, Facebook, email, blogs, and the web with a point and click interface within the familiar spreadsheet.
An introduction in the world of Social Network Analysis and a view on how this may help learning networks. History, data collection and several analysis techniques are shown.
LAK13 Tutorial Social Network Analysis 4 Learning Analyticsgoehnert
Slides of the tutorial "Computational Methods and Tools for Social Network Analysis Networked Learning Communities" at the LAK 2013 in Leuven.
Tutorial Homepage:
http://snatutoriallak2013.ku.de/index.php/SNA_tutorial_at_LAK_2013
Conference Homepage:
http://lakconference2013.wordpress.com/
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories.
Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing.
I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.
1. Basics of Social Networks
2. Real-world problem
3. How to construct graph from real-world problem?
4. What graph theory problem getting from real-world problem?
5. Graph type of Social Networks
6. Special properties in social graph
7. How to find communities and groups in social networks? (Algorithms)
8. How to interpret graph solution back to real-world problem?
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Denis Parra Santander
- First version was a guest lecture about Network Visualization in the class "Data Visualization" taught by Dr. Sharon Hsiao in the QMSS program at Columbia University http://www.columbia.edu/~ih2240/dataviz/index.htm
- This updated version was delivered in our class on SNA at PUC Chile in the MPGI master program.
How to conduct a social network analysis: A tool for empowering teams and wor...Jeromy Anglim
Slides and details available at: http://jeromyanglim.blogspot.com/2009/10/how-to-conduct-social-network-analysis.html
A talk on using social network analysis as a team development tool.
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIRuchika Sharma
This report is done as a part in completion of our Big Data Analysis Course at Jindal Global Business School.
In this report, we have mainly focused on literature review of 10 use-cases in the visualization task. We have worked on use cases pertaining to varied use of social media site Twitter in the political, cultural and business context; use by drug marketers and musicians among others.
Predicting Social Interactions from Different Sources of Location-based Knowl...Michael Steurer
Recent research has shown that digital online geo- location traces are new and valuable sources to predict social interactions between users, e.g. , check-ins via FourSquare or geo-location information in Flickr images. Interestingly, if we look at related work in this area, research studying the extent to which social interactions can be predicted between users by taking more than one location-based knowledge source into account does not exist. To contribute to this field of research, we have collected social interaction data of users in an online social network called My Second Life and three related location-based knowledge sources of these users (monitored locations, shared locations and favored locations), to show the extent to which social interactions between users can be predicted. Using supervised and unsupervised machine learning techniques, we find that on the one hand the same location-based features (e.g. the common regions and common observations) perform well across the three different sources. On the other hand, we find that the shared location information is better suited to predict social interactions between users than monitored or favored location information of the user.
2013 NodeXL Social Media Network AnalysisMarc Smith
Social media network analysis and visualization with NodeXL - the network overview discovery and exploration add-in for Excel. Map Twitter, Facebook, email, blogs, and the web with a point and click interface within the familiar spreadsheet.
Community Structure-based Audience Expansion for Digital AdvertisingEunjae Kim
- Worked in Analytical Data Warehouse team at Yahoo! Inc.
- Designed and implemented Community Structure-based Audience Expansion for Digital Advertising
- The project includes processing data in a distributed environment, training click prediction model, and expanding target audience groups with community detection techniques, using various big data frameworks/softwares
User Behavior Hashing for Audience ExpansionDatabricks
Learning to hash has been widely adopted as a solution to approximate nearest neighbor search for large-scale data retrieval in many applications. Applying deep architectures to learning to hash has recently gained increasing attention due to its computational efficiency and retrieval quality.
Gephi is an open source software for graph and network analysis. It uses a 3D render engine to display large networks in real-time and to speed up the exploration. A flexible and multi-task architecture brings new pos- sibilities to work with complex data sets and produce valuable visual results. We present several key features of Gephi in the context of interactive exploration and interpretation of networks. It provides easy and broad access to network data and allows for spatializing, fil- tering, navigating, manipulating and clustering
Provenance Analytics at AAAI Human Computation Conference 2013T Dong Huynh
Trung Dong Huynh presenting the paper entitled "Interpretation of Crowdsourced Activities using Provenance Network Analysis" - How analysing provenance graphs can help interpreting crowdsouced activities in CollabMap
Analyzing rich club behavior in open source projectsMarco Brambilla
The network of collaborations in an open source project can reveal relevant emergent properties that influence its prospects of success.
In this work, we analyze open source projects to determine whether they exhibit a rich-club behavior, i.e., a phenomenon where contributors with a high number of collaborations (i.e., strongly connected within the collaboration network)
are likely to cooperate with other well-connected individuals. The presence or absence of a rich-club has an impact on the sustainability and robustness of the project.
For this analysis, we build and study a dataset with the 100 most popular projects in GitHub, exploiting connectivity patterns in the graph structure of collaborations that arise from commits, issues and pull requests. Results show that rich-club behavior is present in all the projects, but only few of them have an evident club structure. We compute coefficients both for single source graphs and the overall interaction graph, showing that rich-club behavior varies across different layers of software development. We provide possible explanations of our results, as well as implications for further analysis.
Credit scoring has been used to categorize customers based on various characteristics to evaluate their credit worthiness. Increasingly, machine learning techniques are being deployed for customer segmentation, classification and scoring. In this talk, we will discuss various machine learning techniques that can be used for credit risk applications. Through a case study built in R, we will illustrate the nuances of working with practical data sets which includes categorical and numerical data, different techniques that can be used to evaluate and explore customer profiles, visualizing high dimensional data sets and machine learning techniques for customer segmentation.
EKAW2016 - Interest Representation, Enrichment, Dynamics, and Propagation: A ...GUANGYUAN PIAO
Microblogging services such as Twitter have been widely
adopted due to the highly social nature of interactions they have facilitated. With the rich information generated by users on these services, user modeling aims to acquire knowledge about a user's interests, which is a fundamental step towards personalization as well as recommendations. To this end, researchers have explored dierent dimensions such as (1) Interest Representation, (2) Content Enrichment, (3) Temporal Dynamics of user interests, and (4) Interest Propagation using semantic information from a knowledge base such as DBpedia. However, those dimensions of user modeling have largely been studied separately, and there
is a lack of research on the synergetic eect of those dimensions for user modeling. In this paper, we address this research gap by investigating 16 different user modeling strategies produced by various combinations of those dimensions. Dierent user modeling strategies are evaluated in the context of a personalized link recommender system on Twitter. Results show that Interest Representation and Content Enrichment play crucial roles in user modeling, followed by Temporal Dynamics. The user mod-
eling strategy considering Interest Representation, Content Enrichment and Temporal Dynamics provides the best performance among the 16 strategies. On the other hand, Interest Propagation has little eect on user modeling in the case of leveraging a rich Interest Representation or considering Content Enrichment.
Hierarchical Transformers for User Semantic Similarity - ICWE 2023Marco Brambilla
We discuss the use of hierarchical transformers for user semantic similarity in the context of analyzing users' behavior and profiling social media users. The objectives of the research include finding the best model for computing semantic user similarity, exploring the use of transformer-based models, and evaluating whether the embeddings reflect the desired similarity concept and can be used for other tasks.
We use a large dataset of Twitter users and apply an automatic labeling approach. The dataset consists of English tweets posted in November and December 2020, totaling about 27GB of compressed data. Preprocessing steps include filtering out short texts, cleaning user connections, and selecting a benchmark set of users for evaluation.
Since Transformer architectures are known to work well on short text, we cannot use them on extensive collections of tweets describing the activity of a user. Therefore, we propose a hierarchical structure of transformer models to be used first on tweets and then on their aggregations.
The models used in the study include hierarchical transformers, and the tweet embeddings are obtained using four Transformer-based models: RoBERTa2, BERTweet3, Sentence BERT4, and Twitter4SSE5. The researchers test different techniques for processing tweet embeddings to generate accurate user embeddings, including mean pooling, recurrence over BERT (RoBERT), and transformer over BERT (ToBERT).
The evaluation of the models is done on a set of 5,000 users, comparing user similarities with 30 other candidate users, 5 of which are considered similar and 25 considered dissimilar. The evaluation metrics used include mean average precision (MAP), mean reciprocal rank (MRR) at 10, and normalized discounted cumulative gain (nDCG).
The optimization process involves selecting a loss function and using the AdamW optimizer with specific hyperparameters. The results show that the hierarchical approach with a Stage-1 Twitter4SSE model and a Stage-2 Transformer model performs the best among the alternatives.
In conclusion, the research provides a large unbiased dataset for user similarity analysis, presents a hierarchical language model optimized for accurate user similarity computation, and validates the models' performance on similarity tasks, with potential applications to related problems.
The future work includes investigating the impact of time and topic drift on the models' performance.
Exploring the Bi-verse.A trip across the digital and physical ecospheresMarco Brambilla
The Web and social media are the environments where people post their content, opinions, activities, and resources. Therefore, a considerable amount of user-generated content is produced every day for a wide variety of purposes. On the other side, people live their everyday life immersed in the physical world, where society, economy, politics and personal relations continuously evolve. These two opposite and complementary environment are today fully integrated: they reflect each other and they interact with each other in a stronger and stronger way.
Exploring and studying content and data coming from both environments offers a great opportunity to understand the ever evolving modern society, in terms of topics of interest, events, relations, and behaviour.
In this speech I will discuss through business cases and socio-political scenarios how we can extract insights and understand reality by combining and analyzing data from the digital and physical world, so as to reach a better overall picture of reality itself. Along this path, we need to keep into account that reality is complex and varies in time, space and along many other dimensions, including societal and economic variables. The speech highlights the main challenges that need to be addressed and outlines some data science strategies that can be applied to tackle these specific challenges.
This slide deck has been presented as a keynote speech at WISE 2022 in Biarritz, France.
Trigger.eu: Cocteau game for policy making - introduction and demoMarco Brambilla
COCTEAU stands for "Co-Creating the European Union".
It's a project supported by the European Union whose objective is to involve citizens to cooperate alongside policy makers, contributing to build a better future.
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Marco Brambilla
A large audience of users and typically a long time frame are needed to produce sensible and useful log data, making it an expensive task.
To address this limit, we propose a method that focuses on the generation of REALISTIC NAVIGATIONAL PATHS, i.e., web logs .
Our approach is extremely relevant because it can at the same time tackle the problem of lack of publicly available data about web navigation logs, and also be adopted in industry for AUTOMATIC GENERATION OF REALISTIC TEST SETTINGS of Web sites yet to be deployed.
The generation has been implemented using deep learning methods for generating more realistic navigation activities, namely
Recurrent Neural Network, which are very well suited to temporally evolving data
Generative Adversarial Network: neural networks aimed at generating new data, such as images or text, very similar to the original ones and sometimes indistinguishable from them, that have become increasingly popular in recent years.
We run experiments using open data sets of weblogs as training, and we run tests for assessing the performance of the methods. Results in generating new weblog data are quite good with respect to the two evaluation metrics adopted (BLEU and Human evaluation).
Our study is described in detail in the paper published at ICWE 2020 – International Conference on Web Engineering with DOI: 10.1007/978-3-030-50578-3. It’s available online on the Springer Web site.
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Marco Brambilla
In this study, we demonstrate that the computational social science is important to understand people behavior in political phenomena, and based on the long-running Brexit debate analysis on Twitter, we predict the public stance, discussion topics, and we measure the involvement of automated accounts and politicians’ social media accounts.
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...Marco Brambilla
Over one billion cars interact with each other on the road every day. Each driver has his own driving style, which could impact safety, fuel economy and road congestion. Knowledge about the driving style of the driver could be used to encourage ``better" driving behaviour through immediate feedback
while driving, or by scaling auto insurance rates based on the aggressiveness of the driving style.
In this work we report on our study of driving behaviour profiling based on unsupervised data mining methods. The main goal is to detect the different driving behaviours, and thus to cluster drivers with similar behaviour.
This paves the way to new business models related to the driving sector, such as Pay-How-You-Drive insurance
policies and car rentals.
Driver behavioral characteristics are studied by collecting information from GPS sensors on the cars and by applying three different analysis approaches (DP-means, Hidden Markov Models, and Behavioural Topic Extraction) to the contextual scene detection problems on car trips, in order to detect different
behaviour along each trip. Subsequently, drivers are clustered in similar profiles based on that and the results are compared with a human-defined groundtruth on drivers classification. The proposed framework is tested on a real dataset containing sampled car signals. While the different approaches show relevant differences in trip segment classification, the coherence of the final driver clustering results is surprisingly high.
Myths and challenges in knowledge extraction and analysis from human-generate...Marco Brambilla
For centuries, science (in German "Wissenschaft") has aimed to create ("schaften") new knowledge ("Wissen") from the observation of physical phenomena, their modelling, and empirical validation. Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted. The challenge is once again to capture and create knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data). The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.
The talk explores the mixed realistic-utopian domain of knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Marco Brambilla
Knowledge bases like DBpedia, Yago or Google's Knowledge
Graph contain huge amounts of ontological knowledge harvested from
(semi-)structured, curated data sources, such as relational databases or
XML and HTML documents. Yet, the Web is full of knowledge that is
not curated and/or structured and, hence, not easily indexed, for ex-
ample social data. Most work so far in this context has been dedicated
to the extraction of entities, i.e., people, things or concepts. This poster
describes our work toward the extraction of relationships among entities.
The objective is reconstructing a typed graph of entities and relation-
ships to represent the knowledge contained in social data, without the
need for a-priori domain knowledge. The experiments with real datasets
show promising performance across a variety of domains.
The key distinguishing
feature of the work is its focus on highly unstructured social data (tweets and
Facebook posts) without reliable grammar structures. Traditional relation extraction approaches supervised , semi-supervised or unsupervised,
commonly assume the availability of grammatically correct language corpora.
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...Marco Brambilla
Internet of Things technologies and applications are evolving and continuously gaining traction in all fields and environments, including homes, cities, services, industry and commercial enterprises. However, still many problems need to be addressed. For instance, the
IoT vision is mainly focused on the technological and infrastructure aspect, and on the management and analysis of the huge amount of generated data, while so far the development of front-end and user interfaces for
IoT has not played a relevant role in research. On the contrary, user interfaces in the IoT ecosystem they can play a key role in the acceptance of solutions by final adopters. In this paper we present a model-driven approach to the design of IoT interfaces, by defining a specific visual design language and design patterns for IoT\ applications, and we show them at work. The language we propose is defined as an extension of the OMG standard language called IFML.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.Marco Brambilla
Consumer-centered software applications nowadays are required
to be available both as mobile and desktop versions.
However, the app design is frequently made only for one of
the two (i.e., mobile first or web first) while missing an appropriate
design for the other (which, in turn, simply mimics
the interaction of the first one). This results into poor quality
of the interaction on one or the other platform. Current solutions
would require different designs, to be realized through
different design methods and tools, and that may require to
double development and maintenance costs.
In order to mitigate such an issue, this paper proposes a
novel approach that supports the design of both web and mobile
applications at once. Starting from a unique requirement
and business specification, where web– and mobile–specific
aspects are captured through tagging, we derive a platform independent
design of the system specified in IFML. This
model is subsequently refined and detailed for the two platforms,
and used to automatically generate both the web and
mobile versions. If more precise interactions are needed for
the mobile part, a blending with MobML, a mobile-specific
modeling language, is devised. Full traceability of the relations
between artifacts is granted.
The Web Science course focuses on the study of large-scale socio-technical systems associated with the World Wide Web. It considers the relationship between people and technology, the ways that society and technology complement one another and the way they impact on broader society. These analyses are inherently associated with Big Data management issues.
The course is organised in four parts.
1. Syntax
In the first part, the course introduces the basis of content analysis. If focuses on the syntactic aspects, covering the fundamentals of natural language processing and text mining. It describes the structure and typical characteristics of the different web sources, spanning search results, social media contents, social network structures, Web APIs, and so on. It also provides an overview of the basic Web analysis techniques applied in Web search and Web recommendation.
2. Semantics
In the second part, the course presents semantic technologies. These technologies are very important nowadays because they allow to treat the "variety" dimension of Big Data, i.e., they enable integration of multiple and diverse sources of information, which is typical on the modern Web platform. Covered topics include:
- RDF - a flexible data model to represent heterogeneous data
- OWL - a flexible ontological language to model heterogeneous data sources
- SPARQL - a query language for RDF.
It shows how to put all the pieces together in order to achieve interoperability among heterogeneous information sources
3. Time
The third part covers the realm of temporal-dependent data. The topics covered here allow to treat the "velocity" dimension of Big Data. It shows the importance for many Big Data analysis scenarios to process data stream, coming for instance from Internet of Things (IoT) and Social Media sources; and it describes how to apply semantic and syntactic techniques in the context of time-dependent information. For instance, it shows how to extend RDF to model RDF streams, how to extend SPARQL to continuously process RDF streams and how to reason on those RDF Streams
4. Applications
In the fourth part, the course focuses on specific application scenarios and presents the typical settings and problems where the presented techniques can be applied. This part discusses settings such as: big data analysis for smart cities; data analytics for brand monitoring (marketing) and event monitoring; data analysis for trend detection and user engagement; and so on.
On the Quest for Changing Knowledge. Capturing emerging entities from social ...Marco Brambilla
Massive data integration technologies have been recently used to produce very large ontologies. However, knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.
Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately
reflects the relevant changes which hide emerging entities. Thus, we propose a method for
discovering emerging entities by extracting them from social content.
We start from a purely-syntactic method as a baseline, and we propose a semantics-based method based on entity recognition and DBpedia. The method associates candidate entities to feature vectors, built
from social content by using co-occurrence, and then extracts the emerging entities by using feature similarity measures.
Once instrumented by experts through very simple initialization, the method is capable of finding emerging
entities and extracting their relevant relationships to given types; the method can be
continuously or periodically iterated, using the already identified emerged knowledge as new starting point.
We validate our method by applying it to a set of diverse domain-specific application scenarios, spanning fashion, literature, exhibitions and so on. We show the approach at work and we demonstrate its effectiveness on datasets with different characterization in terms of coverage, dynamics and size.
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Marco Brambilla
Cities are growing as melting pots of people with different culture, religion, and language. In this paper, through multilingual analysis of Twitter contents shared within a city, we analyze the prevalent language in the different neighborhoods of the city and we compare the results with census data, in order to highlight any parallelisms or discrepancies between the two data sources. We show that the officially identified neighborhoods are actually representing significantly different communities and that the use of the social media as a data source helps to detect those weak signals that are not captured from traditional data.
Model driven software engineering in practice book - Chapter 9 - Model to tex...Marco Brambilla
Slides for the mdse-book.com chapter 9 - Model-to-text transformations.
Complete set of slides now available:
Chapter 1 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-1-introduction
Chapter 2 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-2-mdse-principles
Chapter 3 - http://www.slideshare.net/jcabot/model-driven-software-engineering-in-practice-chapter-3-mdse-use-cases
Chapter 4 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-4
Chapter 5 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-5-integration-of-modeldriven-in-development-processes
Chapter 6 - http://www.slideshare.net/jcabot/mdse-bookslideschapter6
Chapter 7 - http://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-7-developing-your-own-modeling-language
Chapter 8 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-8-modeltomodel-transformations
Chapter 9 - https://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-9-model-to-text-transformations-and-code-generation
Chapter 10 - http://www.slideshare.net/jcabot/mdse-bookslideschapter10managingmodels
This book discusses how approaches based on modeling can improve the daily practice of software professionals. This is known as Model-Driven Software Engineering (MDSE) or, simply, Model-Driven Engineering (MDE).
MDSE practices have proved to increase efficiency and effectiveness in software development. MDSE adoption in the software industry is foreseen to grow exponentially in the near future, e.g., due to the convergence of software development and business analysis.
This book is an agile and flexible tool to introduce you to the MDE and MDSE world, thus allowing you to quickly understand its basic principles and techniques and to choose the right set of MDE instruments for your needs so that you can start to benefit from MDE right away.
The book is organized into two main parts.
The first part discusses the foundations of MDSE in terms of basic concepts (i.e., models and transformations), driving principles, application scenarios and current standards, like the wellknown MDA initiative proposed by OMG (Object Management Group) as well as the practices on how to integrate MDE in existing development processes.
The second part deals with the technical aspects of MDSE, spanning from the basics on when and how to build a domain-specific modeling language, to the description of Model-to-Text and Model-to-Model transformations, and the tools that support the management of MDE projects.
The book covers a wide set of introductory and technical topics, spanning MDE at large, definitions and orientation in the MD* world, metamodeling, domain specific languages, model transformations, reverse engineering, OMG's MDA, UML, OCL, A
Model driven software engineering in practice book - chapter 7 - Developing y...Marco Brambilla
Slides for the mdse-book.com - Chapter 7: Developing Your Own Modeling Language.
Complete set of slides now available:
Chapter 1 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-1-introduction
Chapter 2 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-2-mdse-principles
Chapter 3 - http://www.slideshare.net/jcabot/model-driven-software-engineering-in-practice-chapter-3-mdse-use-cases
Chapter 4 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-4
Chapter 5 - http://www.slideshare.net/mbrambil/modeldriven-software-engineering-in-practice-chapter-5-integration-of-modeldriven-in-development-processes
Chapter 6 - http://www.slideshare.net/jcabot/mdse-bookslideschapter6
Chapter 7 - http://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-7-developing-your-own-modeling-language
Chapter 8 - http://www.slideshare.net/jcabot/modeldriven-software-engineering-in-practice-chapter-8-modeltomodel-transformations
Chapter 9 - https://www.slideshare.net/mbrambil/model-driven-software-engineering-in-practice-book-chapter-9-model-to-text-transformations-and-code-generation
Chapter 10 - http://www.slideshare.net/jcabot/mdse-bookslideschapter10managingmodels
This book discusses how approaches based on modeling can improve the daily practice of software professionals. This is known as Model-Driven Software Engineering (MDSE) or, simply, Model-Driven Engineering (MDE).
MDSE practices have proved to increase efficiency and effectiveness in software development. MDSE adoption in the software industry is foreseen to grow exponentially in the near future, e.g., due to the convergence of software development and business analysis.
This book is an agile and flexible tool to introduce you to the MDE and MDSE world, thus allowing you to quickly understand its basic principles and techniques and to choose the right set of MDE instruments for your needs so that you can start to benefit from MDE right away.
The first part discusses the foundations of MDSE in terms of basic concepts (i.e., models and transformations), driving principles, application scenarios and current standards, like the wellknown MDA initiative proposed by OMG (Object Management Group) as well as the practices on how to integrate MDE in existing development processes.
The second part deals with the technical aspects of MDSE, spanning from the basics on when and how to build a domain-specific modeling language, to the description of Model-to-Text and Model-to-Model transformations, and the tools that support the management of MDE projects.
The book covers the MD* world, metamodeling, domain specific languages, model transformations, reverse engineering, OMG's MDA, UML, OCL, ATL, QVT, MOF, Eclipse, EMF, GMF, TCS, xText.
Automatic code generation for cross platform, multi-device mobile apps. An in...Marco Brambilla
This presentation was given at the MobileDeLi workshop 2015 collocated with the Splash 2015 conference.
With the continuously increasing adoption of mobile devices,
software development companies have new business opportunities
through direct sales in app stores and delivery of
business to employee (B2E) and business to business (B2B)
solutions. However, cross-platform and multi-device development
is a barrier for today's IT solution providers, especially
small and medium enterprises (SMEs), due to the high
cost and technical complexity of targeting development to a
wide spectrum of devices, which dier in format, interaction
paradigm, and software architecture. So far, several authors
have proposed the application of model driven approaches
to mobile apps development following a variety of strategies.
In this paper we present the results of a research study conducted
to nd the best strategy for WebRatio, a software
development company, interested in producing a MDD tool
for designing and developing mobile apps to enter the mobile
apps market. We report on a comparative study conducted
to identify the best trade-os between various automatic
code generation approaches.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Community analysis using graph representation learning on social networks
1. Community Analysis
Using Graph Representation Learning
On Social Networks
Marco Brambilla and Mattia Gasparini
Politecnico di Milano
2. Introduction
• Development of platforms such as Instagram and
Facebook increased levels of interaction among
people
• Variety of social networks data exploited to map
users behavior
• Graphs perfectly fit for modeling all the
interactions of these users
2
3. Problem Statement
• Analysis of communities on on-line social
networks, applying machine learning on graphs
• Representation learning is used to extract valuable
information about users inside the community
• Classification of consumer and business users
• Grouping of similar users
3
4. Representation Learning
• Define a continuos representation for each node of the
graph (embedding) to easily apply machine learning
techniques on graphs
• Embeddings are based on neighbourhood nodes:
4
u
u :
5. Node2vec
• Emeddings computations performed using
node2vec algorithm[1], included in the Stanford
Network Analysis Platform (SNAP) library
• The algorithm calculates the embeddings solving an
optimization problem:
max
𝑓
𝑢 ∈𝑉
log Pr(𝑁𝑠(𝑢)|𝑓 𝑢 )
5
[1] Grover and Leskovec. 2016. node2vec: Scalable Feature Learning for Networks.
7. Case Study
• Emerging Italian fashion brand: Emporio Le Sirenuse
• Products: luxury swimsuits and dresses
• Case study is focused on the brand, its competitors
and their communities, defined as the set of
followers users on social network
7
http://www.fashiondatasensing.polimi.it/
8. Related Work
• Users’ communities defined using graph’s structural
properties [himelboim2017, deeb2017, guerrero2017]
• Brand-related communities have a specific role,
with business strategies as final target [ramadan2018,
kim2014, campbell2014]
• Fashion brands gain major advantages from social
media [brambilla2017, schmidt2017]
8
10. 1 – Data Collection
• Web scraping of 10 brands and their followers data
from Instagram
• Time window: from 1 𝑠𝑡
January 2017 to 1 𝑠𝑡
November 2017
• Final database : 400K users, 10M posts
10
11. 2 – Graph Construction
• Graphs are built using several entities: users that we
want to analyze (𝑈𝑡), their posts (𝑃), hashtags
referenced in the posts (𝐻) and mentioned users (𝑈 𝑚)
• Symmetrically, three different types of edges are
defined:
o 𝐸 𝑜𝑤𝑛𝑒𝑟 = 𝑒1, 𝑒2 𝑒1 ∈ 𝑈𝑡, 𝑒2 ∈ 𝑃}
o 𝐸𝑡𝑎𝑔 = 𝑒1, 𝑒2 𝑒1 ∈ 𝑃, 𝑒2 ∈ 𝑇}
o 𝐸 𝑚𝑒𝑛𝑡𝑖𝑜𝑛 = 𝑒1, 𝑒2 𝑒1 ∈ 𝑃, 𝑒2 ∈ 𝑈 𝑚}
11
12. 2 – Graph Construction
• Three graph models are used for the analysis:
1. Mixed network: 𝐺 𝑀 = 𝑈, 𝑃, 𝑇 , 𝐸 𝑜𝑤𝑛𝑒𝑟, 𝐸𝑡𝑎𝑔, 𝐸 𝑚𝑒𝑛𝑡𝑖𝑜𝑛
2. Hashtags network: 𝐺ℎ = 𝑈𝑡, 𝑃, 𝑇 , 𝐸 𝑜𝑤𝑛𝑒𝑟, 𝐸𝑡𝑎𝑔
3. Mentions network: 𝐺 𝑚 = 𝑈𝑡, 𝑈 𝑚, 𝑃 , 𝐸 𝑜𝑤𝑛𝑒𝑟, 𝐸 𝑚𝑒𝑛𝑡𝑖𝑜𝑛
• 𝐺ℎ and 𝐺 𝑚 are subgraphs of 𝐺 𝑀: they map the
influence of specific social media aspects
12
13. Example Hashtags
Network
13
The central part of the graph features
the most connected nodes, which
correspond to the users that
have many hashtags in common.
14. 3 – Graph Reduction
• A reduction process is applied to 𝐺ℎ and 𝐺 𝑚 to obtain «classical» social
networks, where the nodes are the users and the edges are weighted
based on the number of shared entities:
𝑤𝑖𝑗 =
𝑡𝑖 ∩ 𝑡𝑗 , 𝑖𝑓 𝑖, 𝑗 ∈ 𝐺ℎ
𝑚𝑖 ∩ 𝑚𝑗 , 𝑖𝑓 𝑖, 𝑗 ∈ 𝐺 𝑚
where 𝑖, 𝑗 ⊂ 𝑈𝑡, 𝑡𝑖,𝑗 ⊆ 𝑇, 𝑚𝑖,𝑗 ⊆ 𝑈 𝑚
• 𝐺ℎ and 𝐺 𝑚, the reduced hashtags and reduced mentions networks, are
generated
14
16. 4 – Features Extraction
• Both heterogeneous networks 𝐺ℎ,𝑚 and reduced
networks 𝐺ℎ,𝑚 are used to extract the embeddings
• Feature vectors dimension is fixed for the two types
of networks: 𝑑 𝐺 = 8 and 𝑑 𝐺 = 4, respectively.
• Hyper-parameter tuning for 𝑝 and 𝑞 in supervised
setting
16
17. 5 – Classification
• Domain specific task:
«Discriminate between consumer and non-consumer
users»
• Ground-truth of 351 labelled users defined with
domain experts
• Three features set are tested:
• Social media account data(#followers, #following,
#posts, bio)
• Complete network embeddings
• Reduced network embeddings
17
18. 5 – Classification Experiment
18
Description of the user is valuable if a good fraction of the neighborhood
is exploited, which is not always feasible for complete networks.
19. 5 – Classification Experiment on Reduced Networks
19
Performance and number of classified users increase with the number of user nodes
included in the model, even if they are not classified: they enrich the neighborhood and, by
consequence, the features vector.
20. 6 – Clustering
• Hashtags reduced networks 𝐺ℎ used as proxy to
content-based similarity
• K-means is applied on extracted features vectors
• Focus on 𝐺ℎ of Emporio Le Sirenuse community
20
21. 6 – Clustering
Network Input
21
Hashtags Reduced
Network 𝐺ℎ of
Emporio Le Sirenuse
community.
22. 6 – Clustering Features
22
Embeddings extracted from the
network.
First two features components
are used for visualization.
23. 6 – Clustering Output
23
K selection: plot of inertia
against number of clusters
25. 6 – Cluster Validation: Domain Experts
• Domain experts are provided with a subset of users for each
cluster
• Manual inspection of user profiles, providing feedback
about the patterns present in each cluster
25
26. 6 – Cluster Validation: Experts Feedback
• Cluster 0, 1 and 2 very well defined: professionals
users, such as showrooms and other brands
• Cluster 3 contains regular users that share contents
about holidays in Italy
• Clusters 3, 4, 5 and 6 composed mostly by regular
users, too
26
27. 6 – Cluster Labels
27
Cluster labels extracted using the set of hashtags shared at least by two users inside the
cluster.
29. Conclusion
• Results:
• Definition of an effective method to analyze
communities inside social network domain
• Modeling of user similarities through network features
• Detection of content-driven sub-communities
• Future work:
• Inclusion of time variable
29
Good morning, today I am going to present our research work about community analysis using graph representation learning on social networks.
Starting point is that modern social networks such as Instagram and Facebook increased exponentially the number of interactions among people. That variety of data can be exploited to map user behavior. Data itself can perfectly fit to a graph model, capturing users interactions.
Our purpose is to analyze communities on on-line social networks, applying innovative machine learning techniques on graphs.
In the specific, we want to apply representation learning on graphs to describe users inside communities: two main tasks have been developed, one that classifies users in consumers and non-consumers, the other that extracts subgroups of similar users.
.
Just a brief mention to the technique: representation learning is a technique that defines a continuos features vector for each node of the graph, referred to as embedding. The embeddings are learnt with different strategies: as one possible example, focusing on a specific node u, we can exploit local neighbors, the blue nodes in the picture, to learn the feature vectors of u.
Many algorithms are able to perform this operation and we chose node2vec that provides a very flexible technique. It computes the embedding f(u) of a node u using the following equation.
It maximizes log-probability of observing a network neighborhood 𝑁 𝑠 (𝑢) for a node 𝑢 conditioned on its feature representation, given by 𝑓.
We can see how it works here: the intuition is that nodes near in the graph are also near in vector space.
The scenario that we take as case study is about an emerging Italian fashion brand, Emporio Le Sirenuse: this brand is located in Positano, near Neaples, andit mainly produces women luxury swimsuits and dresses.
The work focuses on the community of the brand, defined as the set of its Instagram followers: the idea is that brands can get valuable insights about the specific interests of its followers, and in its way better targeting their products and marketing campaigns.
1st group: community detection on social networks is quite well-known domain, but network structure is not really exploited.
2° group: Analysis of users reaction to brand network marketing, as well as content sharing indicator of brand turst and community commitment
3° group: Instagram is a visual social network that has high potential for fashion brands, that have in visual aspects their main feature
We defined an analysis pipeline and now I will go into the details of each step.
As step 1, data is gathered using web scraping to collect posts of the brands and their followers from Instagram, in a time window that spans from January 2017 to November 2017. We collected around 400K users and 10M posts.
Second step is the definition of the graph model.
We consider as entities users, posts and hashtags and then we define three sets of edges: one connects users to the posts produced, while the other two connect posts to references entities, hashtags and mentions.
The heterogeneous graph 𝐺 𝑀 contains all the entities and relationships. From this graph, two subgraphs are extracted: the hashtags network and the mentions network, that map two important aspects of social media interactions.
As an example, this is a hashtags graph built from gathered data: green nodes are the users (the ones with more connections), blue nodes are posts and hashtags are in orange. The important fact is that users that have many hashtags in common are concetrated in the centre of the network, so their are «near».
In a further step, a graph reduction is applied to previously presented graphs to obtain homogeneous networks, where only users nodes are present.
The reduced graphs are weighted as well, where the weight is based on the number of common entities, either hashtags or mentioned users.
In this way, 𝐺 ℎ and 𝐺 𝑚 are generated.
In this example, you can see a reduced mentions network: edges connect each user to the ones that he or she mentioned and number of mentions, the weight, is mapped to a color, from low (blue), to high (red).
Embeddings are extracted both for heterogeneous networks and reduced networks. Number of dimensions for the output vectors are fixed a-priori: it is set to 8 for heterogeneous networks (that are bigger) and to 4 for reduced networks.
Instead, main parameters of the algorithm p and q are selected via hyper-parameter tuning.
Classification step is defined as to prove the effectiveness of features in our domain: we want to disciminate between consumers and non-consumers users on Instagram. To do so, we manually labelled a set of users with the help of domain experts of Politecnico di Milano fashion department.
Then, a classifier is implemented to test three set of features: social media quantitative features are used as a baseline, compared against features extracted from complete and reduced networks.
Results of first experiment are shown in this table: it is possible to see that reduced network features perform better than complete network ones and than the social media baseline, too.
This is given by the fact that, given a fixed computational power, reduced networks are smaller and so the neighborhood is easier to be exploited.
On the other way, they are able to encode the main dynamics useful for our purpose.
Given first experiment results, we performed a second experiment on reduced networks only.
In this experiment, ground truth network is enlarged using a set of additional non-labelled users taken from followers of different brands: results show that the more users are included the richer is the neighborhood of labeled users and so the performance increases.
As the second task, we want to exploit the features to extract new subgroups of users from the community of the brand, defined as the set of its followers. So, focus is on the community of Emporio Sirenuse, using the reduced hashtags graph as a proxy of content description.
This is the real reduced network over which we run our analysis: each node is a user and the edges connect pair of users that shared same hashtags.
We extract the embeddings of this graph: a 2-d visualization, using the first two components, is presented.
We use a standard parametrization of the algorithm (p=1, q=0.5), that allows to exploit the local neighborhood.
We run K-means over this set of features: K is selected using inertia as structural validation metric.
These are the 7 clusters obtained, as well as the plot of inertia with respect to K.
The output network is presented, with colors associated with clusters.
Clustering needs external measures to validate the results: for this reason, we provided domain experts with a subset of users for each cluster.
They manually inspected the social media profile of each user, providing feedback about presence of patterns inside clusters.
The lists are ordered by distance from centroid, which is used as similarity quality measure
The insights are simple but quite interesting:
Cluster 0, 1 and 2 are users that share very specific contents, such as interior desing or food: they are mainly professional profiles, such as showrooms or brands.
Cluster 3 is very well defined, too, but it contains regular users: they share contents about holidays in Italy, which matches with brand identity.
Clusters 4, 5 and 6 contain regular users with broader contents.
As additional validation, we provided a way to label each cluster: we compute the list of cluster hashtags T(c) as the set of hashtags shared by at least two users inside the cluster (e.g.: hashtags that increase the weights and/or connections inside the cluster).
Then, label is defined as the top 10 hashtags by frequency belonging to this list: these lists are presented in the table, showing a consistent labeling with previous validation (e.g.: cluster 3 use hashtags related to italian vacation, cluster 0 about luxury accessories, cluster 1 about food, …)
𝑇 𝑐 = 𝑢, 𝑣 ∈𝑐 𝑡 𝑢 ∩ 𝑡 𝑣
What we obtain as final result of clustering is a segmentation of users that can be used by brand to better target their marketing campaigns [or to make other collaborations, (e.g.: luxury(0), food (1) and interior design (2) clusters are professionals)].
As final conclusion, in this work we defined an effective method to characterize users inside online communities: users are described using features extracted from their network representation and we are able to use these features to solve domain-specific classification tasks, as well as defining subgroups of users based on shared interests.
In this analysis, time variable is missing and graphs are built using a single snapshot of all the data: having time-varying graphs could potentially capture more fine-grained patterns.