The document outlines a presentation on analyzing social network data from Facebook and VKontakte. It discusses gathering and storing social network data, performing social network analysis, and applying machine learning techniques to detect users' gender, language, and interests from their profile data and network connections. Specific methods are described for gender classification using names, language detection in Facebook profiles, and identifying 253 interest categories that users may belong to based on the groups they join.
A sentiment index measures the average emotional level in a corpus. We introduce four such indexes and use them to gauge average “positiveness” of a population during some period based on posts in a social network. This article for the first time presents a text-, rather than word-based sentiment index. Furthermore, this study presents the first large-scale study of the sentiment index of the Russian-speaking Facebook. Our results are consistent with the prior experiments for English language.
Detecting Gender by Full Name: Experiments with the Russian LanguageAlexander Panchenko
This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character $n$-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96\%.
World Education Services (WES) prepared a credential evaluation report for Inna Mykolayivna Zhuravel summarizing her Ukrainian educational credentials. The report indicates that Zhuravel earned a Bachelor's degree in 2007 from Cherkassy National University in Foreign Language and Literature (English) and a Master's degree in 2009 also from Cherkassy National University in English and German Language and Literature Education. WES authenticated Zhuravel's transcripts and determined that her degrees were equivalent to a Bachelor's and Master's degree from a regionally accredited U.S. institution. The report was sent to Zhuravel and the Minnesota Department of Education as requested.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Sergey Zaika and Andrew Toporkov - Semantic Web on Duty of E- Learning: Ontol...AIST
The document discusses an ontological approach for educating programmers using semantic web technologies. The authors developed an ontological model with 128 nodes and part-of relations. They also generated test questions using techniques like homonyms, quasi-synonyms, paronyms, and antonyms to differentiate meanings based on context. The authors thank the audience for their attention and provide their contact information.
Optimizing Search User Interfaces and Interactions within Professional Social...Nik Spirin
The document discusses optimizing search user interfaces (SUIs) and interactions within professional social networks. It analyzes search behavior data from Facebook to answer four research questions. Key findings include: (1) Users use named entity queries (NEQs) more for friends and structured queries (SQs) more for non-friends, complementing each other; (2) Younger users search more non-friends while older users search more friends; (3) Females write more queries than males across query types; (4) Number of friends correlates with more friend NEQs but less non-friend SQs. The study provides implications for personalizing search suggestions across demographics.
This document summarizes reading proficiency data from the National Assessment of Educational Progress for fourth graders in Charlotte-Mecklenburg Schools from 2003-2013. It finds that while reading proficiency increased from 31% to 40% over this period, there are persistent gaps in proficiency between boys and girls, different racial groups, and students from lower-income families compared to higher-income families. Closing these gaps is a goal of the community-wide Read Charlotte initiative to increase the percentage of third graders reading proficiently to 80% by 2025.
In these slides, the overview of the RusProfiling shared task at PAN@FIRE 2017 in Bangalore, India.
This year task aimed at gender identification in Russian texts in a cross-genre perspective: training on Twitter, evaluating on Twitter, Facebook, reviews, essays and gender-imitated texts.
A sentiment index measures the average emotional level in a corpus. We introduce four such indexes and use them to gauge average “positiveness” of a population during some period based on posts in a social network. This article for the first time presents a text-, rather than word-based sentiment index. Furthermore, this study presents the first large-scale study of the sentiment index of the Russian-speaking Facebook. Our results are consistent with the prior experiments for English language.
Detecting Gender by Full Name: Experiments with the Russian LanguageAlexander Panchenko
This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character $n$-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96\%.
World Education Services (WES) prepared a credential evaluation report for Inna Mykolayivna Zhuravel summarizing her Ukrainian educational credentials. The report indicates that Zhuravel earned a Bachelor's degree in 2007 from Cherkassy National University in Foreign Language and Literature (English) and a Master's degree in 2009 also from Cherkassy National University in English and German Language and Literature Education. WES authenticated Zhuravel's transcripts and determined that her degrees were equivalent to a Bachelor's and Master's degree from a regionally accredited U.S. institution. The report was sent to Zhuravel and the Minnesota Department of Education as requested.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Sergey Zaika and Andrew Toporkov - Semantic Web on Duty of E- Learning: Ontol...AIST
The document discusses an ontological approach for educating programmers using semantic web technologies. The authors developed an ontological model with 128 nodes and part-of relations. They also generated test questions using techniques like homonyms, quasi-synonyms, paronyms, and antonyms to differentiate meanings based on context. The authors thank the audience for their attention and provide their contact information.
Optimizing Search User Interfaces and Interactions within Professional Social...Nik Spirin
The document discusses optimizing search user interfaces (SUIs) and interactions within professional social networks. It analyzes search behavior data from Facebook to answer four research questions. Key findings include: (1) Users use named entity queries (NEQs) more for friends and structured queries (SQs) more for non-friends, complementing each other; (2) Younger users search more non-friends while older users search more friends; (3) Females write more queries than males across query types; (4) Number of friends correlates with more friend NEQs but less non-friend SQs. The study provides implications for personalizing search suggestions across demographics.
This document summarizes reading proficiency data from the National Assessment of Educational Progress for fourth graders in Charlotte-Mecklenburg Schools from 2003-2013. It finds that while reading proficiency increased from 31% to 40% over this period, there are persistent gaps in proficiency between boys and girls, different racial groups, and students from lower-income families compared to higher-income families. Closing these gaps is a goal of the community-wide Read Charlotte initiative to increase the percentage of third graders reading proficiently to 80% by 2025.
In these slides, the overview of the RusProfiling shared task at PAN@FIRE 2017 in Bangalore, India.
This year task aimed at gender identification in Russian texts in a cross-genre perspective: training on Twitter, evaluating on Twitter, Facebook, reviews, essays and gender-imitated texts.
This document describes a graph-based approach for skill extraction from text. It discusses expertise retrieval and previous related work. It then describes the Elisit system, which uses Wikipedia pages and spreading activation on Wikipedia's hyperlink network to associate skills with input documents. Sample queries to the system are provided. The method associates documents with Wikipedia pages and then performs spreading activation to find related skills. Evaluation shows biasing the spreading activation improves results.
This document discusses semantic similarity measures and hybrid measures for semantic relation extraction. It summarizes a presentation given by Alexander Panchenko on similarity measures. The presentation covers pattern-based measures, comparisons of different measures, hybrid measures that combine multiple single measures, and applications of semantic similarity measures like a lexico-semantic search engine and file categorization system. Evaluation shows that supervised hybrid measures like Logit outperform single measures based on precision-recall.
Вычислительная лексическая семантика: метрики семантической близости и их при...Alexander Panchenko
Вычислительная лексическая семантика: метрики семантической близости и их приложения
Серия лекций в НИУ ВШЭ, факультет бизнес-информатики и прикладной математики (Нижний Новгород)
Social Networks @ Scale discusses how to analyze large social networks using graph databases and data science tools. It explains that social network analysis has a long history in sociology and can help solve problems by examining relationships. The document also discusses how Cohort uses social network data and analytics to understand relationships between people. It describes challenges of scaling to billions of connections and messages, and solutions like using Redis, Spark and Kafka to enable real-time processing and analytics of streaming social network data at large scales.
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Alexander Panchenko
This document analyzes formal and informal relationships between Russian-speaking Facebook users based on a dataset of 3.3 million users. It finds that:
- Commenting links (informal relationships) are much less common than friendship links (formal relationships).
- A user has around a 50% chance of being friends with another user they comment on, and a 67% chance if they have a strong commenting relationship.
- However, if two users are friends, there is only a 4% chance they will comment on each other's content, and a 16.5% chance if they are strong friends.
- Informal relationships thus seem to be conditioned by, but not determined by, formal friendships
2015 pdf-marc smith-node xl-social media snaMarc Smith
This document provides an introduction to social network analysis using the NodeXL tool. It discusses how NodeXL can be used to map social media networks from platforms like Twitter. Different types of social media networks are identified, including polarized crowds, tight crowds, brand clusters, community clusters, broadcast networks, and support networks. Examples of social network scholarship are presented. Strategies for using social network analysis to improve engagement on social media are also discussed.
The science of networks is becoming an increasingly important and intriguing area of study that reveals many a patterns and relationships often hidden. This presentation is about the use of SNA to study the network of the Digital Library Community
Semantic Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
The document describes a study on semantic similarity measures for semantic relation extraction. It discusses pattern-based similarity measures that extract semantic relations from lexico-syntactic patterns in corpora. It also covers hybrid semantic similarity measures that combine multiple single similarity measures as features. The document reports on the performance of the pattern-based and hybrid measures in correlating with human judgments of semantic similarity and in ranking semantic relations.
Social Network Analysis for Competitive IntelligenceAugust Jackson
How can CI teams apply the concepts of social network analysis to gain insight into the capabilities and plans of their competitors? Presented by Jim Richardson and August Jackson in April 2007 at the Society of Competitive Intelligence Professionals annual conference in New York City.
Tools for SNA
Page Rank Algorithm
Hierarchical Clustering
Recommendation System based on SNA (Collaborative Filtering)
How Facebook/Amazon uses SNA for recommendations?
Two Hop degree
Dynamism in Friendship Network of CSE-B
Online Social Networks and Clusters
Influential Nodes and Their Importance
Bibliography
Question - Answer Session
Using Social Network Analysis to Assess Organizational Development InitiativesStephanie Richter
Presented at 2016 POD Network conference #POD16
Many Faculty Development centers engage in far-reaching organizational development initiatives within their institutions. These initiatives are incredibly valuable but difficult to assess using traditional methods. Social network analysis (SNA) is a powerful visualization and statistical technique that has multiple applications in researching and assessing organizational development. In this session, learn how SNA was used at one institution to investigate the formation of community regarding online course quality standards as well as to analyze organizational structure for strategic planning. While this session focuses on organizational uses, examples will also be shared of applications for teaching and research.
Social Network Analysis (SNA) and its implications for knowledge discovery in...ACMBangalore
Social Network Analysis (SNA) and its implications for knowledge discovery in Informal Networks- Talk by Dr Jai Ganesh, SETLabs, Infosys at Search and Social Platforms tutorial, as part of Compute 2009, ACM Bangalore
This document discusses gender and age classification using facial features. It explains that while humans can easily determine gender and estimate age from faces, it remains challenging for computers. The document outlines an approach that uses a convolutional neural network (CNN) trained on facial images and metadata scraped from LinkedIn profiles. Baseline classifiers using logistic regression are compared to the CNN, showing the CNN achieves 81% accuracy for gender classification and 68% for age estimation. Future work to improve accuracy is proposed, such as collecting more training data.
Text analytics in Python and R with examples from Tobacco ControlBen Healey
This document discusses text analytics techniques for summarizing and analyzing unstructured text documents, with examples from analyzing documents related to tobacco control. It covers data cleaning and standardization steps like removing punctuation, stopwords, stemming, and deduplication. It also discusses frequency analysis using document-term matrices, topic modeling using LDA, and unsupervised and supervised classification techniques. The document provides examples analyzing posts from new users versus highly active users on an online forum, identifying topics and comparing topic distributions between different user groups.
The steps are as follows:
Step 1: installing required libraries to R environment
Step 2: Generating Token Key from Facebook
Step 3: Retrieving Information from Facebook
Step 4: Advance Graphical Representation
TN3270 Access to Mainframe SNA ApplicationszOSCommserver
This presentation presents the basics of the TN3270 protocol, and discusses how to configure the TN3270 server provided with z/OS Communications Server.
Three different classifiers for facial age estimation based on K-nearest neig...Alaa Tharwat
Abstract - The exact age estimation is often treated as a
classification problem; while it can be formulated as a
regression problem. In this article, three different classifiers based
on KNN classifier's concept for facial age estimation were
designed and developed to achieve high efficiency calculation of
facial age estimation. In the first classifier, we adopt KNN-distance
approach to calculate minimum distance between test face
image and all instances belong to the class that has the highest
number of nearest samples. Additionally, in the second
classifier a modified-KNN version was proposed and the
classifier scoring results interpolated to calculate the exact age
estimation. Furthermore, KNN-regression classifier as third
classifier that used to combine the classification and regression
approaches to improve the accuracy of the age estimation
system. Moreover, we compared age estimation errors under
two situations: case 1, age estimation is performed without
discrimination between males and females (gender unknown);
and case 2, age estimation is performed for males and females
separately (gender known). Results of experiments conducted
on well know benchmark FG-NET Database show the
effectiveness of the proposed approach.
The document discusses improvements made to the VIVO search engine between versions 1.2.1 and 1.3. Version 1.3 transitioned from Lucene to SOLR, allowing it to index additional data from semantic relationships and interconnectivity. This enriched the search index and provided more relevant results with better ranking compared to version 1.2.1. Indexing was also improved through multithreading, reducing indexing time. Experiments were discussed to further enhance search quality through techniques like query expansion, spelling correction, and using ontologies for fact-based questioning.
The document summarizes experiences building a semantic web application to detect conflicts of interest using FOAF and DBLP data. It involved multiple steps: obtaining and preparing data; representing entities and relationships in an ontology; querying the data using semantic associations to determine COI levels; visualizing results; and evaluating based on a conference review dataset. The system was able to detect indirect COI relationships that syntactic matching would miss.
This document describes a graph-based approach for skill extraction from text. It discusses expertise retrieval and previous related work. It then describes the Elisit system, which uses Wikipedia pages and spreading activation on Wikipedia's hyperlink network to associate skills with input documents. Sample queries to the system are provided. The method associates documents with Wikipedia pages and then performs spreading activation to find related skills. Evaluation shows biasing the spreading activation improves results.
This document discusses semantic similarity measures and hybrid measures for semantic relation extraction. It summarizes a presentation given by Alexander Panchenko on similarity measures. The presentation covers pattern-based measures, comparisons of different measures, hybrid measures that combine multiple single measures, and applications of semantic similarity measures like a lexico-semantic search engine and file categorization system. Evaluation shows that supervised hybrid measures like Logit outperform single measures based on precision-recall.
Вычислительная лексическая семантика: метрики семантической близости и их при...Alexander Panchenko
Вычислительная лексическая семантика: метрики семантической близости и их приложения
Серия лекций в НИУ ВШЭ, факультет бизнес-информатики и прикладной математики (Нижний Новгород)
Social Networks @ Scale discusses how to analyze large social networks using graph databases and data science tools. It explains that social network analysis has a long history in sociology and can help solve problems by examining relationships. The document also discusses how Cohort uses social network data and analytics to understand relationships between people. It describes challenges of scaling to billions of connections and messages, and solutions like using Redis, Spark and Kafka to enable real-time processing and analytics of streaming social network data at large scales.
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Alexander Panchenko
This document analyzes formal and informal relationships between Russian-speaking Facebook users based on a dataset of 3.3 million users. It finds that:
- Commenting links (informal relationships) are much less common than friendship links (formal relationships).
- A user has around a 50% chance of being friends with another user they comment on, and a 67% chance if they have a strong commenting relationship.
- However, if two users are friends, there is only a 4% chance they will comment on each other's content, and a 16.5% chance if they are strong friends.
- Informal relationships thus seem to be conditioned by, but not determined by, formal friendships
2015 pdf-marc smith-node xl-social media snaMarc Smith
This document provides an introduction to social network analysis using the NodeXL tool. It discusses how NodeXL can be used to map social media networks from platforms like Twitter. Different types of social media networks are identified, including polarized crowds, tight crowds, brand clusters, community clusters, broadcast networks, and support networks. Examples of social network scholarship are presented. Strategies for using social network analysis to improve engagement on social media are also discussed.
The science of networks is becoming an increasingly important and intriguing area of study that reveals many a patterns and relationships often hidden. This presentation is about the use of SNA to study the network of the Digital Library Community
Semantic Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
The document describes a study on semantic similarity measures for semantic relation extraction. It discusses pattern-based similarity measures that extract semantic relations from lexico-syntactic patterns in corpora. It also covers hybrid semantic similarity measures that combine multiple single similarity measures as features. The document reports on the performance of the pattern-based and hybrid measures in correlating with human judgments of semantic similarity and in ranking semantic relations.
Social Network Analysis for Competitive IntelligenceAugust Jackson
How can CI teams apply the concepts of social network analysis to gain insight into the capabilities and plans of their competitors? Presented by Jim Richardson and August Jackson in April 2007 at the Society of Competitive Intelligence Professionals annual conference in New York City.
Tools for SNA
Page Rank Algorithm
Hierarchical Clustering
Recommendation System based on SNA (Collaborative Filtering)
How Facebook/Amazon uses SNA for recommendations?
Two Hop degree
Dynamism in Friendship Network of CSE-B
Online Social Networks and Clusters
Influential Nodes and Their Importance
Bibliography
Question - Answer Session
Using Social Network Analysis to Assess Organizational Development InitiativesStephanie Richter
Presented at 2016 POD Network conference #POD16
Many Faculty Development centers engage in far-reaching organizational development initiatives within their institutions. These initiatives are incredibly valuable but difficult to assess using traditional methods. Social network analysis (SNA) is a powerful visualization and statistical technique that has multiple applications in researching and assessing organizational development. In this session, learn how SNA was used at one institution to investigate the formation of community regarding online course quality standards as well as to analyze organizational structure for strategic planning. While this session focuses on organizational uses, examples will also be shared of applications for teaching and research.
Social Network Analysis (SNA) and its implications for knowledge discovery in...ACMBangalore
Social Network Analysis (SNA) and its implications for knowledge discovery in Informal Networks- Talk by Dr Jai Ganesh, SETLabs, Infosys at Search and Social Platforms tutorial, as part of Compute 2009, ACM Bangalore
This document discusses gender and age classification using facial features. It explains that while humans can easily determine gender and estimate age from faces, it remains challenging for computers. The document outlines an approach that uses a convolutional neural network (CNN) trained on facial images and metadata scraped from LinkedIn profiles. Baseline classifiers using logistic regression are compared to the CNN, showing the CNN achieves 81% accuracy for gender classification and 68% for age estimation. Future work to improve accuracy is proposed, such as collecting more training data.
Text analytics in Python and R with examples from Tobacco ControlBen Healey
This document discusses text analytics techniques for summarizing and analyzing unstructured text documents, with examples from analyzing documents related to tobacco control. It covers data cleaning and standardization steps like removing punctuation, stopwords, stemming, and deduplication. It also discusses frequency analysis using document-term matrices, topic modeling using LDA, and unsupervised and supervised classification techniques. The document provides examples analyzing posts from new users versus highly active users on an online forum, identifying topics and comparing topic distributions between different user groups.
The steps are as follows:
Step 1: installing required libraries to R environment
Step 2: Generating Token Key from Facebook
Step 3: Retrieving Information from Facebook
Step 4: Advance Graphical Representation
TN3270 Access to Mainframe SNA ApplicationszOSCommserver
This presentation presents the basics of the TN3270 protocol, and discusses how to configure the TN3270 server provided with z/OS Communications Server.
Three different classifiers for facial age estimation based on K-nearest neig...Alaa Tharwat
Abstract - The exact age estimation is often treated as a
classification problem; while it can be formulated as a
regression problem. In this article, three different classifiers based
on KNN classifier's concept for facial age estimation were
designed and developed to achieve high efficiency calculation of
facial age estimation. In the first classifier, we adopt KNN-distance
approach to calculate minimum distance between test face
image and all instances belong to the class that has the highest
number of nearest samples. Additionally, in the second
classifier a modified-KNN version was proposed and the
classifier scoring results interpolated to calculate the exact age
estimation. Furthermore, KNN-regression classifier as third
classifier that used to combine the classification and regression
approaches to improve the accuracy of the age estimation
system. Moreover, we compared age estimation errors under
two situations: case 1, age estimation is performed without
discrimination between males and females (gender unknown);
and case 2, age estimation is performed for males and females
separately (gender known). Results of experiments conducted
on well know benchmark FG-NET Database show the
effectiveness of the proposed approach.
The document discusses improvements made to the VIVO search engine between versions 1.2.1 and 1.3. Version 1.3 transitioned from Lucene to SOLR, allowing it to index additional data from semantic relationships and interconnectivity. This enriched the search index and provided more relevant results with better ranking compared to version 1.2.1. Indexing was also improved through multithreading, reducing indexing time. Experiments were discussed to further enhance search quality through techniques like query expansion, spelling correction, and using ontologies for fact-based questioning.
The document summarizes experiences building a semantic web application to detect conflicts of interest using FOAF and DBLP data. It involved multiple steps: obtaining and preparing data; representing entities and relationships in an ontology; querying the data using semantic associations to determine COI levels; visualizing results; and evaluating based on a conference review dataset. The system was able to detect indirect COI relationships that syntactic matching would miss.
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Anastasia Zhukova
The document describes research on developing an automated system to identify framing and media bias in news articles through analysis of word choice and labeling (WCL). It presents an approach that uses natural language processing to identify semantic concepts that may be targets of bias and then compares how those concepts are framed across multiple news articles reporting on the same event. The approach involves preprocessing text, identifying semantic concepts, analyzing framing of concepts, and measuring framing similarity. It also details a multi-step merging methodology to align candidate concepts across articles and evaluates the approach on an annotated corpus, finding it outperforms baselines at identifying concepts with various levels of complexity in word choice.
Improving VIVO search through semantic ranking.Deepak K
The document discusses improvements made to the VIVO search engine between versions 1.2.1 and 1.3. Version 1.3 transitioned from Lucene to SOLR, enriching the index with additional data from semantic relationships. This led to more relevant search results with better ranking than version 1.2.1. Indexing was also improved through multithreading. Experiments were discussed using synonym analysis, spelling correction, and natural language techniques to further improve search results and support fact-based questioning.
Information in Wikipedia can be edited in over 300 languages independently. Therefore often the same subject in Wikipedia can be described differently depending on language edition. In order to compare information between them one usually needs to understand each of considered languages. We work on solutions that can help to automate this process. They leverage machine learning and artificial intelligence algorithms. The crucial component, however, is assessment of article quality therefore we need to know how to define and extract different quality measures. This presentation briefly introduces some of the recent activities of Department of Information Systems at Poznań University of Economics and Business related to quality assessment of multilingual content in Wikipedia. In particular, we demonstrate some of the approaches for the reliability assessment of sources in Wikipedia articles. Such solutions can help to enrich various language editions of Wikipedia and other knowledge bases with information of better quality.
Slides from the December 2020 Wikimedia Research Showcase: https://www.youtube.com/watch?v=v9Wcc-TeaEY
The document provides an overview of a tutorial on network analysis and the law given by Daniel Martin Katz and Michael J. Bommarito II. It discusses Katz's background in law and network science. The tutorial covers an introduction to network analysis including key concepts like nodes, edges, degree distributions and more. It also discusses applications of network analysis to law including legal elites, diffusion of legal ideas, and judicial citation networks. Advanced topics like community detection algorithms are also outlined.
This document provides an overview of social network analysis, including key concepts, analytic techniques, and examples of classic studies. It discusses the basic components of social networks like actors, ties, and relationships. It also describes different types of networks and measures used in social network analysis, such as degree centrality and betweenness centrality. Finally, it highlights some influential early social network analysis studies and resources for further information.
Optimizing Search Interactions within Professional Social Networks (thesis p...Nik Spirin
We must redesign all major elements of the search user interface, such as input, control, and informational, to provide more effective search interactions for users of professional social networks (PSNs). The existing interfaces deliver suboptimal utility as they underutilize structured nature of professional social networks entities.
Factoid based natural language question generation systemAnimesh Shaw
The proposed system addresses the generation of factoid or wh-questions from sentences in a corpus consisting of factual, descriptive and unbiased details. We discuss our heuristic algorithm for sentence simplification or pre-processing and the knowledge base extracted from previous step is stored in a structured format which assists us in further processing. We further discuss the sentence semantic relation which enables us to construct questions following certain recognizable patterns among different sentence entities, following by the evaluation of question generated.
This document outlines the course structure and content for a Data Science course. The 5 modules cover: 1) introductions to data science concepts and statistical inference using R; 2) exploratory data analysis and machine learning algorithms; 3) feature generation/selection and additional machine learning algorithms; 4) recommendation systems and dimensionality reduction; 5) mining social network graphs and data visualization. The course aims to teach students to define data science fundamentals, demonstrate the data science process, explain necessary machine learning algorithms, illustrate data analysis techniques, and follow ethics in data visualization.
Question Answering over Linked Data - Reasoning IssuesMichael Petychakis
Question answering system plays a vital role in search engine optimization model. Natural language processing methods are typically applied in QA system for inquiring user’s question and numerous steps are also followed for alteration of questions to query form for receiving a precise answer. This presentation analyzes diverse question answering systems that are based on semantic web technologies and ontologies with different formats of queries.It ends by addressing various reasoning alternatives.
This document summarizes a study that analyzed user sentiment and social influence in an online travel forum. The researchers constructed a user network based on forum interactions and used sentiment analysis to determine each user's sentiment score. They then applied a social influence model called a Linear Network Autocorrelation Model (LNAM) to test whether a user's sentiment is influenced by the sentiments of their peers in the network. The LNAM results showed that user sentiments are contagious, with a statistically significant influence parameter. Therefore, the researchers concluded that a user's happiness is influenced by the happiness of their peers in the network.
Knowledge Acquisition from Social Awareness StreamsClaudia Wagner
The document discusses analyzing social awareness streams (SAS) to extract knowledge and emerging ontological structures. It proposes developing a SAS analyzer system to study what knowledge can be acquired from SAS using controlled experiments. Preliminary experiments aim to observe if semantics emerge from different types of SAS aggregations. Initial results indicate hashtag streams are more robust than user list streams and network transformations reveal latent conceptual structures.
Social Network Analysis based on MOOC's (Massive Open Online Classes)ShankarPrasaadRajama
Collected data by conducting a survey about MOOC among fellow classmates and created edge lists of students and their skills and students and MOOC websites they do courses using Python from the survey data.
Performed visualization of student network in UCINET and found out the densities among clusters in the network.
Performed hypothesis testing to see whether characteristic of a student affects their position(centrality) in the network.
Mining and analyzing social media part 1 - hicss47 tutorial - dave kingDave King
This document provides an overview and agenda for part 1 of a presentation on mining and analyzing social media. It introduces key concepts like social homophily and its importance in political cyberbalkanization. It also discusses social network analysis and text mining as approaches for analyzing social media content and connections. Examples of social homophily are shown through gender homophily in adolescent social networks and increasing political polarization in the US Congress and Senate over time.
Mining and Supporting Community Structures in Sensor Network ResearchMarko Rodriguez
The document discusses mining and supporting community structures in sensor network research. It summarizes a study that analyzed the co-authorship network of researchers at the Center for Embedded Network Sensing (CENS) to determine if structural communities detected in the network are independent of socio-academic communities like academic department or affiliation. The study found that structural communities correspond more to department and affiliation, while academic position and country of origin are independent of structural communities.
Network theory examines how relationships between individuals (nodes) connected by interactions (edges) impact organizations. Key concepts include centrality, which measures how well-connected a node is; cliques, which are densely connected subgroups; and structural holes, which are brokers between separate groups. Applying network theory can provide insights into how relationships influence information sharing, collaboration, and social capital within organizations.
This document summarizes Martin Klein's dissertation defense on using the web infrastructure for real-time recovery of missing web pages. The defense agenda includes discussing lexical signatures for web pages, techniques for estimating document frequency values, correlations between term counts and document frequencies, using web page titles and link neighborhoods for rediscovery, and evaluating tags. The motivation is that a significant percentage of web pages become inaccessible over time, so the research aims to re-discover missing pages using content and link-based methods from the live web infrastructure.
The document discusses the development of the Semantic Web, which extends the current web to a web of data through the use of metadata, ontologies, and formal semantics. It describes key technologies like the Resource Description Framework (RDF) and Web Ontology Language (OWL) that add machine-readable meaning to web documents. The Semantic Web aims to enable machines to process and understand the semantics of information on the web.
Similar to Text Analysis of Social Networks: Working with FB and VK Data (20)
Graph's not dead: from unsupervised induction of linguistic structures from t...Alexander Panchenko
In this invited talk, presented at the Dialogue'2018 conference, I argue for the usefulness of graph representations for NLP in the deep learning era. In the lecture, it is described how to extract symbolic linguistic structures, such as word senses and semantic frames in an unsupervised way from text corpora using graph-based algorithms and distributional semantics.
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlAlexander Panchenko
We present DepCC, the largest-to-date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the Common Crawl project. The sentences are processed with a dependency parser and with a named entity tagger and contain provenance information, enabling various applications ranging from training syntax-based word embeddings to open information extraction and question answering. We built an index of all sentences and their linguistic meta-data enabling quick search across the corpus. We demonstrate the utility of this corpus on the verb similarity task by showing that a distributional model trained on our corpus yields better results than models trained on smaller corpora, like Wikipedia. This distributional model outperforms the state of art models of verb similarity trained on smaller corpora on the SimVerb3500 dataset.
http://www.lrec-conf.org/proceedings/lrec2018/summaries/215.html
Improving Hypernymy Extraction with Distributional Semantic ClassesAlexander Panchenko
http://www.lrec-conf.org/proceedings/lrec2018/pdf/234.pdf
In this paper, we show how distributionally-induced semantic classes can be helpful for extracting hypernyms. We present methods for inducing sense-aware semantic classes using distributional semantics and using these induced semantic classes for filtering noisy hypernymy relations. Denoising of hypernyms is performed by labeling each semantic class with its hypernyms. On the one hand, this allows us to filter out wrong extractions using the global structure of distributionally similar senses. On the other hand, we infer missing hypernyms via label propagation to cluster terms. We conduct a large-scale crowdsourcing study showing that processing of automatically extracted hypernyms using our approach improves the quality of the hypernymy extraction in terms of both precision and recall. Furthermore, we show the utility of our method in the domain taxonomy induction task, achieving the state-of-the-art results on a SemEval'16 task on taxonomy induction.
The paper was presented at the LREC'2018 conference in Miyazaki, Japan.
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical ResourcesAlexander Panchenko
The document discusses techniques for inducing interpretable word senses from text for word sense disambiguation and enrichment of lexical resources. It describes related work on inducing word sense representations through methods like retrofitting word embeddings to learn sense embeddings, inducing synsets through graph clustering, and inducing semantic classes. The document also discusses making the induced senses interpretable and linking them to existing lexical resources.
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...Alexander Panchenko
The document describes a system for identifying answers and implicit dialogues in community question answering sites. It examines multiple feature types, including string similarity, word embeddings, topic modeling and keyword features. An SVM classifier is used to rank answers for three subtasks: comment classification, question-comment similarity and question-external comment similarity. Evaluation on the SemEval 2017 Task 3 dataset shows the full feature set performs best, with string and embedding features also providing significant contributions to performance.
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...Alexander Panchenko
Presentation at the AIST'17 conference by Dmitry Ustalov. Authors of the original paper: Dmitry Ustalov, Mikhail Chernoskutov, Chris Biemann, Alexander Panchenko.
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...Alexander Panchenko
The document summarizes details about the 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2017) which was held from July 27-29, 2017 in Moscow, Russia at the Moscow Polytechnic University. Some key details include that there were 127 submissions to the conference this year, with the top 37 papers being selected for publication in the Springer Lecture Notes in Computer Science series. The conference consisted of tracks on various data analysis topics and had an international program committee of over 167 experts from 30 countries.
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationAlexander Panchenko
We introduce a new method for unsupervised knowledge-based word sense disambiguation (WSD) based on a resource that links two types of sense-aware lexical networks: one is induced from a corpus using distributional semantics, the other is manually constructed. The combination of two networks reduces the sparsity of sense representations used for WSD. We evaluate these enriched representations within two lexical sample sense disambiguation benchmarks. Our results indicate that (1) features extracted from the corpus-based resource help to significantly outperform a model based solely on the lexical resource; (2) our method achieves results comparable or better to four state-of-the-art unsupervised knowledge-based WSD systems including three hybrid systems that also rely on text corpora. In contrast to these hybrid methods, our approach does not require access to web search engines, texts mapped to a sense inventory, or machine translation systems.
See the full paper at: http://www.aclweb.org/anthology/W/W17/W17-1909.pdf
Panchenko A., Faralli S., Ponzetto S. P., and Biemann C. (2017): Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation. In Proceedings of the Workshop on Sense, Concept and Entity Representations and their Applications (SENSE) co-located with the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL'2017). Valencia, Spain. Association for Computational Linguistics
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...Alexander Panchenko
The current trend in NLP is the use of highly opaque models, e.g. neural networks and word embeddings. While these models yield state-of-the-art results on a range of tasks, their drawback is
poor interpretability. On the example of word sense induction and disambiguation (WSID), we show that it is possible to develop an interpretable model that matches the state-of-the-art models in accuracy. Namely, we present an unsupervised, knowledge-free WSID approach, which is interpretable at three levels: word sense inventory, sense feature representations, and disambiguation procedure. Experiments show that our model performs on par with state-of-the-art word sense embeddings and other unsupervised systems while offering the possibility to justify
its decisions in human-readable form.
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...Alexander Panchenko
We introduce an approach to word sense
induction and disambiguation. The method
is unsupervised and knowledge-free: sense
representations are learned from distributional
evidence and subsequently used to
disambiguate word instances in context.
These sense representations are obtained
by clustering dependency-based secondorder
similarity networks. We then add
features for disambiguation from heterogeneous
sources such as window-based and
sentence-wide co-occurrences, and explore
various schemes to combine these context
clues. Our method reaches a performance
comparable to the state-of-the-art unsupervised
word sense disambiguation systems
including top participants of the SemEval
2013 word sense induction task and two
more recent state-of-the-art neural word
sense induction systems
Full paper:
https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/publications/konvens2016panchenko.pdf
Getting started in Apache Spark and Flink (with Scala) - Part IIAlexander Panchenko
This document provides an outline and overview of a Spark and Flink tutorial presented by Alexander Panchenko, Gerold Hintz, and Steffen Remus on 11.07.2016. The tutorial covers Scala basics, Spark fundamentals including RDDs and transformations/actions, running Spark applications, and key differences between Spark and Flink. It aims to help users get started with Apache Spark and Flink for distributed data processing and machine learning.
Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal, Chris Biemann
IIT Patna, India
TU Darmstadt, Germany
Presented by: Alexander Panchenko, TU Darmstadt, Germany
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Text Analysis of Social Networks: Working with FB and VK Data
1. Data SNA User Gender User Language User Interests User Matching Other tasks References
Text Analysis of Social Networks:
Working with FB and VK Data
AI Ukraine Conference, Kharkiv, Ukraine
Alexander Panchenko
alexander.panchenko@uclouvain.be
Digital Society Laboratory LLC & UCLouvain
October 25, 2014
Alexander Panchenko Text Analysis of Social Networks
2. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
3. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
4. Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the users’s standpoint
Facebook (FB) and VKontakte (VK)
Alexander Panchenko Text Analysis of Social Networks
5. Data SNA User Gender User Language User Interests User Matching Other tasks References
Social networks from the data miner’s standpoint
Facebook (FB) and VKontakte (VK)
Profiles: a set of user attributes
categorical variables (region, city, profession, etc.)
integer variables (age, graduation year, etc.)
text variables (name, surname, etc.)
Network: a graph that relates users
friendship graph
followers graph
commenting graph, etc.
Texts:
posts
comments
group titles and descriptions
Alexander Panchenko Text Analysis of Social Networks
6. Data SNA User Gender User Language User Interests User Matching Other tasks References
Gathering of VK and FB data
Big Data: VK worth tens or even hundreds of TB
Decide what do you need (posts, profiles, etc.).
Download:
API
Scraping
Download limits and API limitations are specific for each
network.
Parallelization is very practical, especially horizontal one:
Amazon EC2, Distributed Message Queues
Alexander Panchenko Text Analysis of Social Networks
7. Data SNA User Gender User Language User Interests User Matching Other tasks References
Storing VK and FB data
Again, Big Data
NoSQL solutions are helpful
Raw data: Amazon S3
For analysis: HDFS
Efficient retrieval: Elastic Search
Alexander Panchenko Text Analysis of Social Networks
8. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
9. Data SNA User Gender User Language User Interests User Matching Other tasks References
Social Network Analysis
Structure analysis: friendship graph, comments graph, etc.
Content analysis: profile attributes, posts, comments, etc.
Combined approaches.
What scientific communities analyze social networks?
60s – the first structural methods
00s – online social network analysis boom
Social Network Analysis community (Sociologists,
Statisticians, Physicists)
Data and Graph Mining community
Natural Language Processing community
Alexander Panchenko Text Analysis of Social Networks
10. Data SNA User Gender User Language User Interests User Matching Other tasks References
Technologies for analysis of social networks
Machine Learning: hidden vs observable user attributes
Training of the model often can be scaled vertically
Applying the model should be scaled horizontally
Alexander Panchenko Text Analysis of Social Networks
11. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
12. Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Andrey Teterin.
Detect gender of a user
to profile a user;
user segmentation is helpful in search, advertisement, etc.
By text written by a user:
Ciot et al. [2013], Koppel et al. [2002], Goswami et al. [2009],
Mukherjee and Liu [2010], Peersman et al. [2011], Rao et al.
[2010], Rangel and Rosso Rangel and Rosso [2013], Al Zamal
et al.Al Zamal et al. [2012] and Lui et al. Liu et al. [2012].
By full name: Burger et al. [2011], Panchenko and Teterin
[2014]
Alexander Panchenko Text Analysis of Social Networks
13. Data SNA User Gender User Language User Interests User Matching Other tasks References
Online demo
http://research.digsolab.com/gender
Alexander Panchenko Text Analysis of Social Networks
14. Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
100,000 full names of Facebook users with known gender
full name – first and last name of a user
gender: male or female
names written in both Cyrillic and Latin alphabets
“Alexander Ivanov”, “Masha Sidorova”, “Pavel Nikolenko”, etc.
Alexander Panchenko Text Analysis of Social Networks
15. Data SNA User Gender User Language User Interests User Matching Other tasks References
Training Data
Figure : Name-surname Alexander co-Panchenko occurrences: Text rows Analysis and of columns Social Networks
are sorted by
16. Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
72% of first names have typical male/female ending
68% of surnames have typical male/female ending
a typical male/female ending splits males from females with an
error less than 5%
gender of 50% first names recognized with 8 endings
gender of 50% second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classify
about 30% of names. This motivates the need for a more
sophisticated statistical approach.
Alexander Panchenko Text Analysis of Social Networks
17. Data SNA User Gender User Language User Interests User Matching Other tasks References
Character endings of Russian names
Type Ending Gender Error, % Example
first name na (íà) female 0.27 Ekaterina
first name iya (èÿ) female 0.32 Anastasiya
first name ei (åé) male 0.16 Sergei
first name dr (äð) male 0.00 Alexandr
first name ga (ãà) male 4.94 Serega
first name an (àí) male 4.99 Ivan
first name la (ëà) female 4.23 Luidmila
first name ii (èé) male 0.34 Yurii
second name va (âà) female 0.28 Morozova
second name ov (îâ) male 0.21 Objedkov
second name na (íà) female 2.22 Matyushina
second name ev (åâ) male 0.44 Sergeev
second name in (èí) male 1.94 Teterin
Table : Most discriminative and frequent two character endings of
Russian names.
Alexander Panchenko Text Analysis of Social Networks
18. Data SNA User Gender User Language User Interests User Matching Other tasks References
Gender Detection Method
input: a string representing a name of a person
output: gender (male or female)
binary classification task
Features
endings
character n-grams
dictionary of male/female names and surnames
Model
L2-regularized Logistic Regression
Alexander Panchenko Text Analysis of Social Networks
19. Data SNA User Gender User Language User Interests User Matching Other tasks References
Features
Character n-grams
males: Alexander Yaroskavski, Oleg Arbuzov
females: Alexandra Yaroskavskaya, Nayaliya Arbuzova
BUT: “Sidorenko”, “Moroz” or “Bondar”!
two most common one-character endings: “a” and “ya” (“ÿ”)
Dictionaries of first and last names
probability that it belongs to the male gender:
P(c = malejfirstname), P(c = malejlastname).
3,427 first names, 11,411 last names
Alexander Panchenko Text Analysis of Social Networks
20. Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Model Accuracy Precision Recall F-measure
rule-based baseline 0,638 0,995 0,633 0,774
endings 0,850 0,002 0,921 0,003 0,784 0,004 0,847 0,002
3-grams 0,944 0,003 0,948 0,003 0,946 0,003 0,947 0,003
dicts 0,956 0,002 0,992 0,001 0,925 0,003 0,957 0,002
endings+3-grams 0,946 0,003 0,950 0,002 0,947 0,004 0,949 0,003
3-grams+dicts 0,956 0,003 0,960 0,003 0,957 0,004 0,959 0,003
endings+3-grams+dicts 0,957 0,003 0,961 0,003 0,959 0,004 0,960 0,002
Table : Results of the experiments on the training set of 10,000 names.
Here endings – 4 Russian female endings, trigrams – 1000 most frequent
3-grams, dictionary – name/surname dict. This table presents precision,
recall and F-measure of the female class.
Alexander Panchenko Text Analysis of Social Networks
21. Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure : Learning curves of single and combined models. Accuracy was
estimated on separate sample of 10,000 names.
Alexander Panchenko Text Analysis of Social Networks
22. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
23. Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Motivation
Goal: to detect Russian-speaking users
Cyrillic alphabet is used also by Ukrainian, Belorussian,
Bulgarian, Serbian, Macedonian, Kazakh, etc
Research Questions
Which method is the best for Russian language?
How to adopt it to the FB profile?
Contributions
Comparison of Russian-enabled language detection modules.
A technique for identification of Russian-speaking users.
Alexander Panchenko Text Analysis of Social Networks
24. Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
input: a FB user profile
output: is Russian-speaker? (or a set of languages user speaks)
Common Russian character trigrams
íà , ïð, òî, íå, ëè, ïî, íî , â , íà,
òü, íå, è , êî, îì, ïðî, òî , èõ, êà, àòü,
îòî, çà, èå, îâà, òåë, òîð, äå, îé , ñòè,
îò, àõ , ìè, ñòð, áå, âî, ðà, àÿ , âàò, åé ,
åò , æå, è÷å, èÿ , îâ , ñòî, îá, âåð, ãî , è
â, è ï, è ñ, èè , èñò, î â, îñò, òðà, òå, åëè,
åðå, êîò, ëüí, íèê, íòè, î ñ
Alexander Panchenko Text Analysis of Social Networks
25. Data SNA User Gender User Language User Interests User Matching Other tasks References
Existing modules for language identification
langid.py
https://github.com/saffsd/langid.py
Advanced n-gram selection
chromium compact language detector (cld)
https://code.google.com/p/chromium-compact-language-detector
guess-language
https://code.google.com/p/guess-language
Google Translate API
https://developers.google.com/translate/v2/using_
rest#detect-language
20$/1M characters
Yandex Translate API
http://api.yandex.ru/translate
Free of charge, 1M of characters / day (by September 2013)
Many more, e.g. language-detection for Java
Alexander Panchenko Text Analysis of Social Networks
26. Data SNA User Gender User Language User Interests User Matching Other tasks References
DBpedia Dataset
Alexander Panchenko Text Analysis of Social Networks
27. Data SNA User Gender User Language User Interests User Matching Other tasks References
Accuracy of Different Language Detection Modules
Alexander Panchenko Text Analysis of Social Networks
28. Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset: Method
Profile text: posts + comments + user names – Latin
symbols.
Profile text length: 3,367 +- 17,540
Russian-speakers: P(ru) 0.95
Core Russian-speakers:
P(ru) 0.95
# Cyrillic symbols = 20%
locale is ru_RU
Alexander Panchenko Text Analysis of Social Networks
29. Data SNA User Gender User Language User Interests User Matching Other tasks References
Facebook Dataset: Results
9,906,524 public FB profiles (= 50 cyr. characters)
8,687,915 (88%) Russian-speaking users
3,190,813 (32%) core Russian-speaking public Facebook users
5,365,691 (54%) of profiles with no profile text (= 200
characters)
Alexander Panchenko Text Analysis of Social Networks
30. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
31. Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Sergei Objedkov.
input: some SN data representing a user
output: list of user interests
Motivation
Advertisement: targeting, user segmentation, etc.
Recommendations of content and friends
Customization of user experience
. . .
Alexander Panchenko Text Analysis of Social Networks
32. Data SNA User Gender User Language User Interests User Matching Other tasks References
Data: FB and VK groups
Text corpus
41 million of VK groups
11 million of FB publics
1.5 million of FB groups
Data format
Title and/or description
List of members
Number of comments, likes, posts by a member
Alexander Panchenko Text Analysis of Social Networks
33. Data SNA User Gender User Language User Interests User Matching Other tasks References
Data: VK groups
Alexander Panchenko Text Analysis of Social Networks
34. Data SNA User Gender User Language User Interests User Matching Other tasks References
253 interests detected by our system
Alexander Panchenko Text Analysis of Social Networks
35. Data SNA User Gender User Language User Interests User Matching Other tasks References
Method
1 Create a text index of groups
2 Create a keyword list for each of 253 interests
3 KW classifier:
Retrieve top k groups retrieved by a set interest keywords
Rank by TF-IDF
Associate group’s interests with its users
A group may have multiple interests
4 ML classifier:
Use top k groups as a training data
BOW features
Keyword features
Linear models: L2 LR, Liner SVM, NB
Classify all groups
A group may have up to three top interests
Associate group’s interests with its users
Alexander Panchenko Text Analysis of Social Networks
36. Data SNA User Gender User Language User Interests User Matching Other tasks References
Association of group’s interests with its users
Engagement of a person into an interest category is
proportional to the activity of the person in groups of this category:
e wlike l + ws:comm cs + wl:comm cl + wrepost r
l – the number of post likes
cs – the number of short comments
cl – the number of long comments
r – the number of reposts
Association score of a user and an interest depends on
engagement in a group and on the number of groups:
all efb gfb +
37. evk gvk :
evk , efb – engagement into VK/FB interest
gvk , gfb – number of groups a user has in FB/VK
Alexander Panchenko Text Analysis of Social Networks
38. Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Alexander Panchenko Text Analysis of Social Networks
39. Data SNA User Gender User Language User Interests User Matching Other tasks References
Results per category: the best and the worst
Alexander Panchenko Text Analysis of Social Networks
40. Data SNA User Gender User Language User Interests User Matching Other tasks References
Top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
41. Data SNA User Gender User Language User Interests User Matching Other tasks References
Intersection of the top 30 interests on FB and VK
Alexander Panchenko Text Analysis of Social Networks
42. Data SNA User Gender User Language User Interests User Matching Other tasks References
Interests co-occurrences
Alexander Panchenko Text Analysis of Social Networks
43. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
44. Data SNA User Gender User Language User Interests User Matching Other tasks References
Problem
Joint work with Dmitry Babaev and Segei Objedkov.
Motivation
input: a user profile of one social network
output: profile of the same person in another social network
immediate applications in marketing, search, security, etc.
Contribution
user identity resolution approach
precision of 0.98 and recall of 0.54
the method is computationally effective and easily parallelizable
Alexander Panchenko Text Analysis of Social Networks
45. Data SNA User Gender User Language User Interests User Matching Other tasks References
Dataset
VK Facebook
Number of users in our dataset 89,561,085 2,903,144
Number of users in Russia 1 100,000,000 13,000,000
User overlap 29% 88%
training set: 92,488 matched FB-VK profiles
1According to comScore and http://vk.com/about
Alexander Panchenko Text Analysis of Social Networks
46. Data SNA User Gender User Language User Interests User Matching Other tasks References
Profile matching algorithm
1 Candidate generation. For each VK profile we retrieve a set
of FB profiles with similar first and second names.
2 Candidate ranking. The candidates are ranked according to
similarity of their friends.
3 Selection of the best candidate. The goal of the final step
is to select the best match from the list of candidates.
Alexander Panchenko Text Analysis of Social Networks
47. Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate generation
Retrieve FB users with names similar to the input VK profile.
Two names are similar if the first letters are the same and the
edit distance between names 2.
Levenshtein Automata for fuzzy match between a VK user
name and all FB user names
Automatically extracted dictionary of name synonyms:
“Alexander”, “Sasha”, “Sanya”, “Sanek”, etc.
Alexander Panchenko Text Analysis of Social Networks
48. Data SNA User Gender User Language User Interests User Matching Other tasks References
Candidate ranking
The higher the number of friends with similar names in VK
and FB profiles, the greater the similarity of these profiles.
Two friends are considered to be similar if:
First two letters of their last names match
Similarity between first/last names sims are greater than
thresholds ;
49. :
sims (si ; sj ) = 1
lev (si ; sj )
max(jsi j; jsj j)
;
Contribution of each friend to similarity simp of two profiles
pvk and pfb is inverse of name expectation frequency:
X
simp(pvk ; pfb) =
j:sims (sf
i ;sf
j )^sims (ss
i ;ss
j )
50. min(1;
N
jsf
j j jss
j j
):
Here sf
i and ss
i are first and second names of a VK profile,
correspondingly, while sf
j and ss
j refer to a FB profile.
Alexander Panchenko Text Analysis of Social Networks
51. Data SNA User Gender User Language User Interests User Matching Other tasks References
Best candidate selection
FB candidates are ranked according to similarity simp to an
input profile pvk
The best candidate pfb should pass two thresholds to match:
its score should be higher than the score threshold
:
simp(pvk ; pfb)
:
either the only candidate or score ratio between it and the next
best candidate p0
fb should be higher than the ratio threshold :
simp(pvk ; pfb)
simp(pvk ; p0
fb)
:
Alexander Panchenko Text Analysis of Social Networks
52. Data SNA User Gender User Language User Interests User Matching Other tasks References
Results
Figure : Precision-recall plot of the matching method. The bold line
denotes the best precision at given recall.
Alexander Panchenko Text Analysis of Social Networks
53. Data SNA User Gender User Language User Interests User Matching Other tasks References
Results: matching VK and FB profiles
First name threshold, 0.8
Second name threshold,
54. 0.6
Profile score threshold,
3
Profile ratio threshold, 5
Number of matched profiles 644,334 (22%)
Expected precision 0.98
Expected recall 0.54
Alexander Panchenko Text Analysis of Social Networks
55. Data SNA User Gender User Language User Interests User Matching Other tasks References
Outline
1 Social Network Data
2 Social Network Analysis
3 User Gender Detection
4 User Language Detection
5 User Interests Detection
6 VK-FB User Matching
7 Other SNA Tasks
Alexander Panchenko Text Analysis of Social Networks
56. Data SNA User Gender User Language User Interests User Matching Other tasks References
Much more fun stuff can be done with the FB/VK data
User Age Region Detection
Tell me who are your friends, and I will say who you are.
Most frequent age/region of friends.
Reject users with high variation of age/region among friends.
Up to 85-90% of precision.
User Income Detection
Transfer learning: target variable is not present in SNs.
Training a model on a set of users with known income.
Applying the model on the social network profiles.
Alexander Panchenko Text Analysis of Social Networks
57. Data SNA User Gender User Language User Interests User Matching Other tasks References
Thank you! Questions?
Alexander Panchenko Text Analysis of Social Networks
58. Data SNA User Gender User Language User Interests User Matching Other tasks References
Al Zamal, F., Liu, W., and Ruths, D. (2012). Homophily and latent
attribute inference: Inferring latent attributes of twitter users
from neighbors. In ICWSM.
Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011).
Discriminating gender on twitter. In Proceedings of the
Conference on Empirical Methods in Natural Language
Processing, pages 1301–1309. Association for Computational
Linguistics.
Ciot, M., Sonderegger, M., and Ruths, D. (2013). Gender inference
of twitter users in non-english contexts. In Proceedings of the
2013 Conference on Empirical Methods in Natural Language
Processing, Seattle, Wash, pages 18–21.
Goswami, S., Sarkar, S., and Rustagi, M. (2009). Stylometric
analysis of bloggers’ age and gender. In Third International AAAI
Conference on Weblogs and Social Media.
Koppel, M., Argamon, S., and Shimoni, A. R. (2002).
Alexander Panchenko Text Analysis of Social Networks
59. Data SNA User Gender User Language User Interests User Matching Other tasks References
Automatically categorizing written texts by author gender.
Literary and Linguistic Computing, 17(4):401–412.
Liu, W., Zamal, F. A., and Ruths, D. (2012). Using social media to
infer gender composition of commuter populations. Proceedings
of the When the City Meets the Citizen Worksop.
Mukherjee, A. and Liu, B. (2010). Improving gender classification
of blog authors. In Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Processing, pages
207–217. Association for Computational Linguistics.
Peersman, C., Daelemans, W., and Van Vaerenbergh, L. (2011).
Predicting age and gender in online social networks. In
Proceedings of the 3rd international workshop on Search and
mining user-generated contents, pages 37–44. ACM.
Rangel, F. and Rosso, P. (2013). Use of language and author
profiling: Identification of gender and age. Natural Language
Processing and Cognitive Science, page 177.
Alexander Panchenko Text Analysis of Social Networks
60. Data SNA User Gender User Language User Interests User Matching Other tasks References
Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010).
Classifying latent user attributes in twitter. In Proceedings of the
2nd international workshop on Search and mining user-generated
contents, pages 37–44. ACM.
Alexander Panchenko Text Analysis of Social Networks