This document describes a thesis submitted to fulfill the requirements for a Bachelor of Technology degree in computer science and engineering. The thesis aims to improve the efficiency of identifying members of a community using seed set expansion. It explores different seed expansion algorithms and identifies opportunities to improve performance. The authors develop their own modification of the PageRank algorithm and show it achieves higher performance than existing approaches. They evaluate their methods on multiple publicly available datasets containing ground-truth communities.
A Usability Evaluation carried out on my second year Brunel Group project.
A.R.C. (Augmented Reality Communicator), is an augmented reality social networking application , designed and built for my second year group project at Brunel University.
Organisering av digitale prosjekt: Hva har IT-bransjen lært om store prosjekter?Torgeir Dingsøyr
IT-bransjen har gjort store endringer i måten de gjennomfører prosjekter på gjennom bruk av smidige metoder. Disse metodene ble først brukt på små, samlokaliserte team men brukes nå også i store prosjekter med mange team og flere hundre utviklere. Hvordan jobber IT-bransjen for å sikre vellykkede store prosjekter?
This thesis presents a framework for integrated uncertainty modeling and visualization. It aims to address four major barriers: (1) users must anticipate their uncertainty needs before building models, (2) uncertainty parameters are treated the same as variables, (3) uncertainty propagation must be manually managed, and (4) visualization techniques are largely incompatible with different uncertainty types. The framework encapsulates uncertainty into atomic variables, automates uncertainty propagation, and abstracts visual mappings from the underlying uncertainty type. It extends the traditional spreadsheet to intrinsically support uncertainty modeling and visualization. Case studies demonstrate the framework for business planning, financial decision support, and process specifications.
The document presents a thesis that developed and evaluated a localization component for mobile service applications. The thesis implemented a platform called YourWay! that collects contextual data from distributed sources and facilitates instant location information to mobile users. Empirical evaluation of YourWay! assessed user experience in indoor and outdoor environments. Results showed user experience was more reliable within community WiFi infrastructure, especially indoors, depending on access point coverage, density, and structure.
Este documento presenta instrucciones para resolver un problema de termodinámica que consta de 5 apartados. Indica que se deben poner cifras significativas en las soluciones numéricas y usar el formato decimal reconocido por el sistema operativo. También especifica cómo nombrar el archivo de Excel y que solo se debe enviar la plantilla con los resultados.
A Usability Evaluation carried out on my second year Brunel Group project.
A.R.C. (Augmented Reality Communicator), is an augmented reality social networking application , designed and built for my second year group project at Brunel University.
Organisering av digitale prosjekt: Hva har IT-bransjen lært om store prosjekter?Torgeir Dingsøyr
IT-bransjen har gjort store endringer i måten de gjennomfører prosjekter på gjennom bruk av smidige metoder. Disse metodene ble først brukt på små, samlokaliserte team men brukes nå også i store prosjekter med mange team og flere hundre utviklere. Hvordan jobber IT-bransjen for å sikre vellykkede store prosjekter?
This thesis presents a framework for integrated uncertainty modeling and visualization. It aims to address four major barriers: (1) users must anticipate their uncertainty needs before building models, (2) uncertainty parameters are treated the same as variables, (3) uncertainty propagation must be manually managed, and (4) visualization techniques are largely incompatible with different uncertainty types. The framework encapsulates uncertainty into atomic variables, automates uncertainty propagation, and abstracts visual mappings from the underlying uncertainty type. It extends the traditional spreadsheet to intrinsically support uncertainty modeling and visualization. Case studies demonstrate the framework for business planning, financial decision support, and process specifications.
The document presents a thesis that developed and evaluated a localization component for mobile service applications. The thesis implemented a platform called YourWay! that collects contextual data from distributed sources and facilitates instant location information to mobile users. Empirical evaluation of YourWay! assessed user experience in indoor and outdoor environments. Results showed user experience was more reliable within community WiFi infrastructure, especially indoors, depending on access point coverage, density, and structure.
Este documento presenta instrucciones para resolver un problema de termodinámica que consta de 5 apartados. Indica que se deben poner cifras significativas en las soluciones numéricas y usar el formato decimal reconocido por el sistema operativo. También especifica cómo nombrar el archivo de Excel y que solo se debe enviar la plantilla con los resultados.
O documento descreve a medusa-da-lua (Aurelia aurita), incluindo seu habitat, alimentação, reprodução e o fato de não estar ameaçada. Detalha que pode ser encontrada em águas costeiras e oceânicas entre 9-19°C, se alimenta de plâncton, crustáceos e pequenos peixes, e se reproduz sexualmente na primavera e verão, com os óvulos fertilizados fora do corpo.
Este documento resume los principales componentes de hardware de un sistema informático como la placa base, periféricos y accesorios. También describe los sistemas operativos Windows, Linux y la gestión de redes. Por último, explica las aplicaciones básicas de ofimática como los procesadores de texto, hojas de cálculo y bases de datos.
Este documento explica los conceptos de integral impropia y límites infinitos de integración. Una integral impropia ocurre cuando la función a integrar no es continua en todo el intervalo de integración o cuando uno o ambos límites de integración se acercan a infinito. Existen tres tipos de integrales impropias dependiendo de si presentan asíntotas horizontales, verticales o una mezcla de ambas. El valor de una integral impropia puede calcularse evaluando la integral para límites finitos y tomando el límite a medida que estos se
Para crear una cuenta de Gmail, el usuario debe ingresar a www.gmail.com, hacer clic en "Crear cuenta", completar los datos personales requeridos como nombre y apellido, leer el contrato y hacer clic en "Acepto" para finalizar el proceso de creación de la cuenta.
Una integral impropia es el límite de una integral definida cuando uno o ambos extremos del intervalo de integración se acercan a un número real específico, a ∞, o a -∞. Las integrales impropias surgen cuando la función integrando no es continua en todo el intervalo de integración o tiene singularidades (puntos asintóticos verticales) en los límites. Se clasifican en tres tipos dependiendo de si presentan asíntotas horizontales, verticales o una mezcla, y son convergentes solo si sus componentes de primer y segundo tipo son conver
Apresentação da equipe da Take.net, com nomes e funções. A agenda inclui discussões sobre testes móveis em diferentes dispositivos, sistemas operacionais e tamanhos de tela, além dos desafios dos testes em arquitetura de micro-serviços. Samantha Nunes fala sobre a diversidade de dispositivos, sistemas operacionais, tamanhos de tela e desempenho, e tipos de testes realizados, como automatizados.
History and status of the Ukraine crisis as of September 2016 by Quintus Dias of Manticore Group. Documenting the early stages of the potential opening of WWIII.
This document provides a summary of an individual's qualifications and experience. In 3 sentences:
The individual has over 10 years of experience in BPO/KPO roles such as quality control, production supervision, and as a subject matter expert. Their experience includes working with databases, coding products, and ensuring quality standards. They currently serve as a global subject matter expert for certain categories with responsibilities including quality control, training, and resolving client queries.
El documento define el cine como la tecnología que reproduce fotogramas rápidamente creando la ilusión de movimiento. Explica que la palabra proviene del griego y se relaciona con el movimiento. Además, señala que los hermanos Lumière proyectaron la primera película en 1895 mostrando la salida de obreros de una fábrica.
O documento descreve o sistema reprodutor humano, incluindo os processos de reprodução assexuada e sexuada, fecundação interna e externa, e o desenvolvimento do embrião. Também discute os sistemas reprodutores masculino e feminino, a gravidez, o nascimento e cuidados com a criança.
This thesis examines an unsupervised approach to classifying users in online social networks using only simple statistics about users' behavior. The author applies sparse principal component analysis (SPCA) to Twitter data without using text or profile content. Key contributions include:
1. Demonstrating that meaningful user classification is possible using only statistics on network structure and communication patterns.
2. Developing a "semantic robustness" score to evaluate how well classifications retain meaning when reanalyzing subsets of the data.
3. Identifying distinct types of users from the top principal components, including measures of influence, spam detectors, and content providers.
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Roman Atachiants
Users who cannot formulate a precise query but know there must be a good answer somewhere, often rely on exploratory search. This requires an interactive and responsive system, or else the user will soon give up. As data bases are becoming larger, more specialized, and more distributed this calls for a Rich Internet Application, fast enough to keep pace with the users explorations. This thesis studies and implements a system, called MultiMap, which computes similarity maps in real-time. This entailed: (1) precomputing every data structure that does not change after the initial query, (2) optimizing algorithms for zooming and map generation (3) and providing a cognitively appropriate visualization of high dimensional space. Applied to a very large movie database, it resulted in a highly responsive, satisfying, usable system.
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
The document describes developing an interactive web information retrieval and visualization system. The system aims to make information searching and presentation easier and more efficient. It does this through speech recognition, keyword extraction from text, query construction and expansion using concepts, filtering and summarizing search results, and visualization. The system architecture includes these main components and was tested with satisfactory results. However, some challenges remain in creating a smooth presentation experience.
User behavior model & recommendation on basis of social networks Shah Alam Sabuj
At present social networks play an important role to express people's sentiment and interest in a particular field. Extracting a user's public social network data (what the user shares with friends and relatives and how the user reacts over others' thought) means extracting the user's behavior. Defining some determined hypothesis if we make machine understand human sentiment and interest, it is possible to recommend a user about his/her personal interest on basis of the user's sentiment analyzed by machine. Our main approach is to suggest a user regarding the user's specific interest that is anticipated by analyzing the user's public data. This can be extended to further business analysis to suggest products or services of different companies depending on the consumer's personal choice. This automation will also help to choose the correct candidate for any questionnaire. This system will also help anyone to know about himself or herself, how one's behavior may influence others. It is possible to identify different types of people such as- dependable people, leadership skilled, people of supportive mentality, people of negative mentality etc.
This document is the master's thesis of Tesfay Aregay submitted to the University of Twente in 2014. The thesis defines research conducted to identify highly influential ranking factors for search engines. It presents two approaches - calculating correlation between ranking factors and webpage rank, and training ranking models to select important features. The thesis provides results analyzing ranking factors for a Dutch web dataset and a Learning to Rank benchmark dataset. It evaluates the approaches and identifies ranking factors that positively impact search engine rankings.
This thesis examines attribution modelling of online advertising using machine learning techniques. The thesis analyzes a dataset containing user activity logs such as clicks, impressions, and conversions. Both user-level and sequence-level analysis are conducted using logistic regression and Markov chain models. At the user level, techniques like random undersampling, synthetic oversampling, and weighted logistic regression address the class imbalance issue in conversion data. The sequence-level analysis uses first-order Markov chains to model user pathways. The thesis evaluates different models using various metrics and aims to accurately attribute conversions while maintaining model interpretability.
This document is the master's thesis of Natascha Abrek submitted to the Technical University of Munich on October 14, 2015. The thesis proposes designing and implementing a mobile application for collaborative structuring of knowledge-intensive processes. Knowledge-intensive processes involve activities like knowledge sharing, reuse and collaboration between knowledge workers. However, such processes are unpredictable and dynamic in nature. The thesis aims to develop a mobile version of the existing web application Darwin to facilitate structuring of knowledge-intensive processes on mobile devices according to usability guidelines. An evaluation of the developed mobile solution will also be conducted to incorporate design improvements iteratively.
Restaurant recommendation system is a very popular service whose so-
phistication keeps increasing everyday.In this paper we present a per-
sonalised restaurant recommendation system which has two parts to
it. The rst part recommends users' restaurants based on their restau-
rant review history. The second part recommends business owners with
places perfect to open a restaurant with a particular cuisine where the
owner would get the best trac for the restaurant. Using Zomato data,
we built a restaurant recommendation system for the individuals and
business owners. For each user in our data we nd out the cuisine
preferences and other restrictions such as services oered, ambience,
average rating, etc. and based on that we recommend the restaurants
accordingly. We propose a metric that takes the popularity as well as
the sentiment of opinions for the food items based on the user gener-
ated reviews as opposed to other systems where which only consider
the features mentioned above to recommend restaurants.
Restaurant e-menu on iPad, Rapid Application Development (RAD), Model-View-Controller (MVC), ASP.Net, Xcode, Web services, iPad application and mobile application development.
The document is Nathaniel Knapp's master's thesis titled "Parasite: Local Scalability Profiling for Parallelization" submitted to Technische Universität München. The thesis presents Parasite, a tool that measures the parallelism of function call sites in programs parallelized using Pthreads. Parasite calculates the parallelism ratio, which is an upper bound on potential speedup and useful for evaluating scalability. The thesis demonstrates Parasite on sorting algorithms, molecular dynamics simulations, and other programs to analyze parallelism and identify factors limiting scalability.
O documento descreve a medusa-da-lua (Aurelia aurita), incluindo seu habitat, alimentação, reprodução e o fato de não estar ameaçada. Detalha que pode ser encontrada em águas costeiras e oceânicas entre 9-19°C, se alimenta de plâncton, crustáceos e pequenos peixes, e se reproduz sexualmente na primavera e verão, com os óvulos fertilizados fora do corpo.
Este documento resume los principales componentes de hardware de un sistema informático como la placa base, periféricos y accesorios. También describe los sistemas operativos Windows, Linux y la gestión de redes. Por último, explica las aplicaciones básicas de ofimática como los procesadores de texto, hojas de cálculo y bases de datos.
Este documento explica los conceptos de integral impropia y límites infinitos de integración. Una integral impropia ocurre cuando la función a integrar no es continua en todo el intervalo de integración o cuando uno o ambos límites de integración se acercan a infinito. Existen tres tipos de integrales impropias dependiendo de si presentan asíntotas horizontales, verticales o una mezcla de ambas. El valor de una integral impropia puede calcularse evaluando la integral para límites finitos y tomando el límite a medida que estos se
Para crear una cuenta de Gmail, el usuario debe ingresar a www.gmail.com, hacer clic en "Crear cuenta", completar los datos personales requeridos como nombre y apellido, leer el contrato y hacer clic en "Acepto" para finalizar el proceso de creación de la cuenta.
Una integral impropia es el límite de una integral definida cuando uno o ambos extremos del intervalo de integración se acercan a un número real específico, a ∞, o a -∞. Las integrales impropias surgen cuando la función integrando no es continua en todo el intervalo de integración o tiene singularidades (puntos asintóticos verticales) en los límites. Se clasifican en tres tipos dependiendo de si presentan asíntotas horizontales, verticales o una mezcla, y son convergentes solo si sus componentes de primer y segundo tipo son conver
Apresentação da equipe da Take.net, com nomes e funções. A agenda inclui discussões sobre testes móveis em diferentes dispositivos, sistemas operacionais e tamanhos de tela, além dos desafios dos testes em arquitetura de micro-serviços. Samantha Nunes fala sobre a diversidade de dispositivos, sistemas operacionais, tamanhos de tela e desempenho, e tipos de testes realizados, como automatizados.
History and status of the Ukraine crisis as of September 2016 by Quintus Dias of Manticore Group. Documenting the early stages of the potential opening of WWIII.
This document provides a summary of an individual's qualifications and experience. In 3 sentences:
The individual has over 10 years of experience in BPO/KPO roles such as quality control, production supervision, and as a subject matter expert. Their experience includes working with databases, coding products, and ensuring quality standards. They currently serve as a global subject matter expert for certain categories with responsibilities including quality control, training, and resolving client queries.
El documento define el cine como la tecnología que reproduce fotogramas rápidamente creando la ilusión de movimiento. Explica que la palabra proviene del griego y se relaciona con el movimiento. Además, señala que los hermanos Lumière proyectaron la primera película en 1895 mostrando la salida de obreros de una fábrica.
O documento descreve o sistema reprodutor humano, incluindo os processos de reprodução assexuada e sexuada, fecundação interna e externa, e o desenvolvimento do embrião. Também discute os sistemas reprodutores masculino e feminino, a gravidez, o nascimento e cuidados com a criança.
This thesis examines an unsupervised approach to classifying users in online social networks using only simple statistics about users' behavior. The author applies sparse principal component analysis (SPCA) to Twitter data without using text or profile content. Key contributions include:
1. Demonstrating that meaningful user classification is possible using only statistics on network structure and communication patterns.
2. Developing a "semantic robustness" score to evaluate how well classifications retain meaning when reanalyzing subsets of the data.
3. Identifying distinct types of users from the top principal components, including measures of influence, spam detectors, and content providers.
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Roman Atachiants
Users who cannot formulate a precise query but know there must be a good answer somewhere, often rely on exploratory search. This requires an interactive and responsive system, or else the user will soon give up. As data bases are becoming larger, more specialized, and more distributed this calls for a Rich Internet Application, fast enough to keep pace with the users explorations. This thesis studies and implements a system, called MultiMap, which computes similarity maps in real-time. This entailed: (1) precomputing every data structure that does not change after the initial query, (2) optimizing algorithms for zooming and map generation (3) and providing a cognitively appropriate visualization of high dimensional space. Applied to a very large movie database, it resulted in a highly responsive, satisfying, usable system.
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
The document describes developing an interactive web information retrieval and visualization system. The system aims to make information searching and presentation easier and more efficient. It does this through speech recognition, keyword extraction from text, query construction and expansion using concepts, filtering and summarizing search results, and visualization. The system architecture includes these main components and was tested with satisfactory results. However, some challenges remain in creating a smooth presentation experience.
User behavior model & recommendation on basis of social networks Shah Alam Sabuj
At present social networks play an important role to express people's sentiment and interest in a particular field. Extracting a user's public social network data (what the user shares with friends and relatives and how the user reacts over others' thought) means extracting the user's behavior. Defining some determined hypothesis if we make machine understand human sentiment and interest, it is possible to recommend a user about his/her personal interest on basis of the user's sentiment analyzed by machine. Our main approach is to suggest a user regarding the user's specific interest that is anticipated by analyzing the user's public data. This can be extended to further business analysis to suggest products or services of different companies depending on the consumer's personal choice. This automation will also help to choose the correct candidate for any questionnaire. This system will also help anyone to know about himself or herself, how one's behavior may influence others. It is possible to identify different types of people such as- dependable people, leadership skilled, people of supportive mentality, people of negative mentality etc.
This document is the master's thesis of Tesfay Aregay submitted to the University of Twente in 2014. The thesis defines research conducted to identify highly influential ranking factors for search engines. It presents two approaches - calculating correlation between ranking factors and webpage rank, and training ranking models to select important features. The thesis provides results analyzing ranking factors for a Dutch web dataset and a Learning to Rank benchmark dataset. It evaluates the approaches and identifies ranking factors that positively impact search engine rankings.
This thesis examines attribution modelling of online advertising using machine learning techniques. The thesis analyzes a dataset containing user activity logs such as clicks, impressions, and conversions. Both user-level and sequence-level analysis are conducted using logistic regression and Markov chain models. At the user level, techniques like random undersampling, synthetic oversampling, and weighted logistic regression address the class imbalance issue in conversion data. The sequence-level analysis uses first-order Markov chains to model user pathways. The thesis evaluates different models using various metrics and aims to accurately attribute conversions while maintaining model interpretability.
This document is the master's thesis of Natascha Abrek submitted to the Technical University of Munich on October 14, 2015. The thesis proposes designing and implementing a mobile application for collaborative structuring of knowledge-intensive processes. Knowledge-intensive processes involve activities like knowledge sharing, reuse and collaboration between knowledge workers. However, such processes are unpredictable and dynamic in nature. The thesis aims to develop a mobile version of the existing web application Darwin to facilitate structuring of knowledge-intensive processes on mobile devices according to usability guidelines. An evaluation of the developed mobile solution will also be conducted to incorporate design improvements iteratively.
Restaurant recommendation system is a very popular service whose so-
phistication keeps increasing everyday.In this paper we present a per-
sonalised restaurant recommendation system which has two parts to
it. The rst part recommends users' restaurants based on their restau-
rant review history. The second part recommends business owners with
places perfect to open a restaurant with a particular cuisine where the
owner would get the best trac for the restaurant. Using Zomato data,
we built a restaurant recommendation system for the individuals and
business owners. For each user in our data we nd out the cuisine
preferences and other restrictions such as services oered, ambience,
average rating, etc. and based on that we recommend the restaurants
accordingly. We propose a metric that takes the popularity as well as
the sentiment of opinions for the food items based on the user gener-
ated reviews as opposed to other systems where which only consider
the features mentioned above to recommend restaurants.
Restaurant e-menu on iPad, Rapid Application Development (RAD), Model-View-Controller (MVC), ASP.Net, Xcode, Web services, iPad application and mobile application development.
The document is Nathaniel Knapp's master's thesis titled "Parasite: Local Scalability Profiling for Parallelization" submitted to Technische Universität München. The thesis presents Parasite, a tool that measures the parallelism of function call sites in programs parallelized using Pthreads. Parasite calculates the parallelism ratio, which is an upper bound on potential speedup and useful for evaluating scalability. The thesis demonstrates Parasite on sorting algorithms, molecular dynamics simulations, and other programs to analyze parallelism and identify factors limiting scalability.
This document is a degree project from KTH Royal Institute of Technology that examines using social media analysis to predict stock prices. Specifically, it collected Twitter data related to Microsoft, Netflix, and Walmart and used machine learning algorithms like artificial neural networks to analyze the relationship between sentiment in tweets and future stock movement. The best model achieved 80% accuracy in predicting the direction of price changes for one of the companies based on Twitter sentiment alone.
This document presents the results of a quasi-experimental study conducted to improve communication in a large-scale agile environment. Initial interviews and observations were conducted to understand current communication practices. A baseline survey was administered to measure communication. A treatment was designed based on the findings and prior research, focusing on enhancing communication with team coaches and increasing information relevance. The treatment was applied and a follow-up survey administered. Results showed improvements in understanding of work and information sharing. The study demonstrated the potential for experimental research to improve processes in industry contexts, while also highlighting challenges in conducting experiments at large scale.
This document describes the development of a web scraping tool to extract useful mobile app market data from Appannie's website. The tool automates browsing to Appannie pages using Selenium, scrapes app name, description and version history from individual app pages, and saves the data to CSV files. It iterates through Appannie's top charts from the past year for the US and Chinese markets to build a structured dataset for analysis and to help app developers. The project uses an agile development approach with weekly iterations to expand the tool's functionality and optimize performance over time.
This document describes a dynamic multimodal diagnostic interface that was developed as a Master's thesis project. The interface uses a web-based multimodal approach to conduct diagnostic interviews by dynamically generating pages in conjunction with a diagnostic dialog manager. The goal was to demonstrate how combining aspects of artificial intelligence with a multimodal interface could deliver a human proxy for conducting diagnostic interviews. The document outlines the background, problem, solution, development and implementation of the system, and discusses potential practical applications and future work.
This master's thesis explores designing, analyzing, and experimentally evaluating a distributed community detection algorithm. Specifically:
- A distributed version of the Louvain community detection method is developed using the Apache Spark framework. Its convergence and quality of detected communities are studied theoretically and experimentally.
- Experiments show the distributed algorithm can effectively parallelize community detection.
- Graph sampling techniques are explored for accelerating parameter selection in a resolution-limit-free community detection method. Random node selection and forest fire sampling are compared.
- Recommendations are made for choice of sampling algorithm and parameter values based on the comparison.
This thesis examines machine learning approaches using Hadoop in the cloud. It implements a distributed machine learning infrastructure in the cloud without dependence on distributed file systems or shared memory. This infrastructure learns and configures a distributed network of learners. The results are then filtered, fused and visualized. The thesis also develops a machine learning infrastructure using Python and compares the two approaches. It uses real-world immigration and GDP datasets from a government database to test the frameworks. The cloud-based approach is able to scale to petabytes of data with minimal configuration.
This thesis proposes a method called FESPA (Feature Extraction and Selection for Predictive Analytics) to improve the predictive analytics solution of Quintiq by adding automatic feature generation and selection capabilities. FESPA is based on ExploreKit, an existing feature generation and selection method. The thesis evaluates FESPA on several datasets, finding that it does not decrease performance compared to manual feature selection, and significantly improves performance for some datasets. Factors like the background collections used for feature generation and the operators applied are also analyzed. The thesis aims to balance improved predictive accuracy with runtime efficiency to provide a flexible solution for Quintiq users.
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
This thesis presents a scalable distributed clustering algorithm for streaming big data. The author implemented a real-time distributed clustering algorithm and a classification algorithm using the Scalable Advanced Massive Online Analysis (SAMOA) framework. SAMOA is a platform-independent framework for distributed machine learning on data streams. It provides interfaces for algorithms to be run on distributed stream processing engines like Apache S4 and Twitter Storm. The author's algorithms were tested on these platforms using the SAMOA framework.
The document describes a research study that developed an automated billing system with a touch screen interface for the Office of the Treasurer of Municipal Government of Nasugbu, Batangas. The system automates tax billing and allows users to view income reports through a touch screen. The researchers used the waterfall model and tested the system to determine if it improved efficiency, reliability, security and other factors compared to the existing manual system. Results found the proposed automated system was rated as excellent or very satisfactory across all evaluation criteria.
1. IMPROVING EFFICIENCY OF COMMUNITY
MEMBER IDENTIFICATION USING SEED SET
EXPANSION
A thesis submitted in partial fulfilment of the requirements for
the award of the degree of
B. Tech
In
Computer Science and Engineering
By
Abishek Prasanna (106111002)
R Sibi (106111068)
Rahul R (106111070)
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
NATIONAL INSTITUTEOF TECHNOLOGY
TIRUCHIRAPALLI-620015
MAY 2015
2. BONAFIDE CERTIFICATE
This is to certify that the project titled IMPROVING EFFICIENCY OF
COMMUNITY MEMBER IDENTIFICATION USING SEED SET
EXPANSION is a bonafide record of the work done by
Abishek Prasanna (106111002)
R Sibi (106111068)
Rahul R (106111070)
in partial fulfilment of the requirements for the award of the degree of Bachelor of
Technology in Computer Science and Engineering of the NATIONAL
INSTITUTE OF TECHNOLOGY, TIRUCHIRAPPALLI, during the year 20014-
2015.
Dr. E. Sivasankar Dr. (Mrs) R. Leela Velusamy
Guide Head of the Department
Project Viva-voce held on _____________________________
Internal Examiner External Examiner
3. i
ABSTRACT
In many applications, network of people are involved and would like to identify the
members of an interesting but unlabelled group or community. Start with a small
number of exemplar group members – they may be followers of a political ideology
or fans of a music genre – and use those examples to discover the additional members.
This problem gives rise to the seed expansion problem in community detection: given
example community members, how can the social graph be used to predict the
identities of remaining, hidden community members? In contrast with global
community detection (graph partitioning or covering), seed expansion is best suited
for identifying communities locally concentrated around nodes of interest. A growing
body of work has used seed expansion as a scalable means of detecting overlapping
communities. Yet despite growing interest in seed expansion, there are divergent
approaches in the literature and there still isn’t a systematic understanding of which
approaches work best in different domains.
Here several variants are evaluated and subtle trade-offs between different approaches
have been uncovered. The different ideas in the algorithms that give room for
performance gains, focusing on heuristics that one can control in practice are
explored. As a consequence of this systematic understanding, several opportunities for
performance gains were discovered. We have thereby managed to develop our own
modification to the PageRank algorithm and have shown its higher performance in
comparison with the existing ones. This leads to interesting connections and contrasts
with active learning and the trade-offs of exploration and exploitation. Finally, we
explore the expansion problem by bringing in an adaptive algorithm that is found to
work well with the improved version that we have come up with. We evaluate our
methods across multiple domains, using publicly available datasets with labelled,
ground-truth communities.
Keywords: Seed set expansion, Ground-truth communities
4. ii
ACKNOWLEDGEMENTS
We would like to thank our project guide Dr.E.Sivasankar , Assistant Professor,
Department of Computer Science and Engineering , for his constant guidance ,
encouragement and help during the entire duration of the project. His enthusiasm has
been a driving force for our efforts through the course of this project.
We would also like to offer our sincere thanks to Dr. (Mrs).R.Leela Velusamy, Head
of the Department, Computer Science and Engineering, National Institute of
Technology, Trichy who provided us with the necessary environment, tools and
feedback for the implementation of the project.
5. iii
TABLE OF CONTENTS
Title Page Number
ABSTRACT……………………………………………………………… i
ACKNOWLEDGEMENTS …………………………………………….. ii
TABLE OF CONTENTS…………………………………………………. iii
LIST OF FIGURES ……………………………………………………… v
NOTATIONS …………………………………………………………….. vi
CHAPTER 1: INTRODUCTION
1.1 Motivation …………………………………………………….. 1
1.2 Community ……………………………………………………. 1
1.3 Community detection …………………………………………. 2
1.4 Practical Applications of Clustering , Community Detection… 3
1.5 Purpose Of Community Detection ……………………………... 4
1.5.1 Is it necessary to extract groups based on network topology? 5
1.5.2 Importance of network interaction ………………………. 5
1.6 Challenges ……………………………………………………… 5
1.7 Thesis Overview………………………………………………… 7
CHAPTER 2: LITERATURE SURVEY
2.1 Neighbour Counting Algorithm …………………………………. 8
2.1.1 Community Discovery Methods …………………………. 8
2.1.1.1 Graph Partition Techniques ………………………… 8
2.1.1.2 Hierarchical Clustering …………………………….. 9
2.1.2 Expanding an existing community………………………… 10
2.2 Greedy algorithm ……………………………………………….. 14
2.2 Drawbacks of Greedy algorithm ……………………………….. 16
2.3 Drawbacks of PageRank algorithm …………………………….. 16
6. iv
CHAPTER 3: IMPLEMENTATION
3.1 Finding potential members ………………………………………… 17
3.2 Improved PageRank ……………………………………………….. 19
CHAPTER 4: PERFORMANCE ANALYSIS
4.1 Neighbour Counting ………………………………………………. 25
4.2 Greedy Algorithm …………………………………………………. 26
4.3 PageRank algorithm ………………………………………………. 27
4.4 Comparing performances of PageRank and the proposed Improved
PageRank ………………………………………………………….. 28
4.5 Inference …………………………………………………………... 31
CONCLUSION AND FUTURE WORK …………………………………..... 32
REFERENCES ……………………………………………………………….. 33
BIBLIOGRAPHY …………………………………………………………… 34
APPENDICES
Appendix A Snippets of Codes ……………………………………... 36
Appendix B Glossary of Terms …………………………………….. 38
7. v
LIST OF FIGURES
Figure Page No
1.1 Community Structure …..........................................................................… 2
1.2 Categorization of various search engines …............................................... 4
1.3 Typical life cycle of a social media network …………….......................... 4
2.1 Neighbour counting representation ….......................................................... 14
3.1 Finding potential neighbour ….................................................................... 17
3.2 Flowchart depicting neighbour counting algorithm ….................................. 19
3.3 Flowchart depicting improved PageRank algorithm …............................... 21
3.4 Input WebGraph …........................................................................................ 22
3.5 Existing community members …................................................................. 22
3.6 Calculating PageRank using the existing code …......................................... 23
3.7 Calculating PageRank using improved algorithm ….................................... 23
3.8 Potential members are determined …............................................................ 24
3.9 Graph showing analysis between PageRank and Improved page rank …….. 24
4.1 Neighbour counting algorithm steps ………………………………………… 25
4.2 Greedy algorithm steps ……………………………………………………… 26
4.3 Proposed Improved PageRank Algorithm steps ……………………………. 27
4.4 Graphs showing iteration of PageRank ……………………………………… 28
4.5 Modularity Comparison between the three algorithms ……………………… 30
4.6 Outwardness Comparison between the three algorithms ……………………. 31
8. vi
NOTATIONS
Q Modularity
eij Edge directed from node ‘i’ to node ‘j’
Pr[n,k] Probability that entity would happen to have at least ‘k’ of n
neighbours from group by chance
P Fraction of known group members to network nodes
R Local modularity
Bij Adjacency matrix comprising only those edges with one or more end
points in community
Min Number of edges internal to community
Mout Number of edges external to community
Ov(C) Outwardness of vertex ‘v’ in community ‘C’
Kv Degree of vertex ‘v’
Kv
out
Number of neighbours outside community
Kv
in
Number of neighbours inside community
PR[A] PageRank of node ‘A’
C[A] Total number of outgoing links on ‘A’
N Total number of nodes in the graph
d Damping factor
9. 1
CHAPTER 1
INTRODUCTION
1.1 Motivation
Networks are omnipresent on the Web. The most profound Web network is the Web
itself comprising billions of pages as vertices and their hyperlinks to each other as
edges. Moreover, collecting and processing the input of Web users (e.g. queries,
clicks) results in other forms of networks, such as the query graph. Finally, the
widespread use of Social Media applications, such as Bibsonomy, IMDB, Flickr and
YouTube, is responsible for the creation of even more networks, ranging from
folksonomy networks to rich media social networks. Not only is it possible by
analyzing such networks to gain insights into the social phenomena and processes that
take place in the world, but one can also extract actionable knowledge that can be
beneficial in several information management and retrieval tasks, such as online
content navigation and recommendation. However, the analysis of such networks
poses serious challenges to data mining methods, since these networks are almost
invariably characterized by huge scales and a highly dynamic nature.
A valuable tool in the analysis of large complex networks is community detection.
The problem that community detection attempts to solve is the identification of
groups of vertices that are more densely connected to each other than to the rest of the
network. Detecting and analyzing the community structure of networks has led to
important findings in a wide range of domains, ranging from biology to social
sciences and the Web. Such studies have shown that communities constitute
meaningful units of organization and that they provide new insights in the structure
and function of the whole network under study. Recently, there has been increasing
interest in applying community detection on Social Media networks not only as a
means of understanding the underlying phenomena taking place in such systems, but
also to exploit its results in a wide range of intelligent services and applications, e.g.
recommendation engines, automatic event detection in Social Media content.
10. 2
1.2 Community
It is formed by individuals such that those within a group interact with each other
more requently than with those outside the group. A network community (also
sometimes referred to as a module or cluster) is typically thought of as a group of
nodes with more and/or better interactions amongst its mem bers than between its
members and the remainder of the network. Figure 1.1 shows a sample community
structure with three interlinked communities.
Figure 1.1: Community structure
1.3 Community Detection
Several attempts have been made to provide a formal definition for this generally
described community detection concept in networks. A strong community was defined
as a group of nodes for which each node of the community has more edges to other
nodes of the same community than to nodes outside the community. This is a
relatively strict definition, in the sense that it does not allow for overlapping
communities and creates a hierarchical community structure since the entire graph can
be a community itself. A weak community was later defined as a subgraph in which
the sum of all node degrees within the community is larger than the sum of all node
degrees toward the rest of the graph [6].
11. 3
It is the process of discovering groups in a network where individuals group
memberships are not explicitly given. The problem of cluster or community detection
in real world graphs that involves large social networks, web graphs and biological
networks is a problem of considerable practical interest and has received a lot of
attention recently. To extract such sets of nodes one typically chooses an objective
function that captures the above intuition of a community as a set of nodes with better
internal connectivity than external connectivity. Then, since the objective is typically
NP-hard to optimize exactly, one employs heuristics or approximation algorithms to
find sets of nodes that approximately optimize the objective function and that can be
understood or interpreted as real communities. Alternatively, one might define
communities operationally to be the output of a community detection procedure,
hoping they bear some relationship to the intuition as to what it means for a set of
nodes to be a good community. Once extracted, such clusters of nodes are often
interpreted as organizational units in social networks, functional units in biochemical
networks, ecological niches in food web networks, or scientific disciplines in citation
and collaboration networks.
1.4 Practical applications of clustering, community detection
Recommendation tools for forming on-line groups have the potential to collect a
few initial suggestions from a user and then produce a longer list of recommended
group members.
Similarly, a marketer may want to expand a set of a few interested consumers of a
product into a longer list of people who might also be interested in the product.
Seed set expansion has also been used to infer missing attributes in user profile
data [3] and to detect e-mail addresses of spammers.
Simplifies visualization, analysis on complex graphs
Search engines – Categorization. Figure 1.2 depicts the a dendo-gram of various
search engines. A similarity threshold is employed to isolate the clusters of similar
qualities.
12. 4
Figure 1.2: Categorization of various search engines
Social networks - Useful for tracking group dynamics. Social media network’s
typical life cycle has been depicted in figure 1.3. Raw social media network is
formulated by clustering recorded transactions. It is further simplified to form a clean
network.
Figure 1.3: Typical life cycle of a social media network.
Neural networks - Tracks functional units
One major challenge in neuroscience is to identify the functional modules from
multichannel, multiple subjects’ recordings. Most research on community detection
has focused on finding the association matrix based on functional connectivity,
instead of effective connectivity, thus not capturing the causality in the network.
Food webs - helps isolate co-dependent groups of organisms
1. Clustering obtained by
cutting the dendo-gram at a
desired level.
2. Each connected component
forms a cluster.
3. Basically we use some
similarity threshold to get
the clusters of desired
quality.
Recorded
Transactions
Raw Social Media
Network
Clean Simplified
Network
13. 5
1.5 Purpose of community detection
Understanding the interactions between people.
Visualising and navigating huge networks.
Forming the basis for other tasks such as data mining.
Social networks often include community groups based on common location,
interests, occupation, etc. Communities are present in metabolic networks
based on functional groupings. Communities are formed in citation networks
based on research topic. By identifying these sub-structures within a network
can provide knowledge about how network function and topology affect each
other.
1.5.1 Is it necessary to extract groups based on network topology?
All Social media websites do not provide community platform
All people do not want to make effort to join groups.
Through community extraction communities can be suggested to people based
on their interests.
Groups in the real world change dynamically.
Besides social media websites it is essential to extract communities in other
networks such as citation networks, World Wide Web, metabolism networks
for various practical purposes.
1.5.2 Importance of network interaction
Rich information about the relationship between users can be obtained through
analysing network interaction which can complement other kinds of
information, e.g. user profile.
It provides basic information that are essential for other tasks, e.g.
recommendation.
Analysing network interaction helps in network visualization and navigation.
14. 6
1.6 Challenges
The major challenges usually encountered in the problem of community detection in
networks are highlighted below:
Scalability
The amount of online media content over the internet is rising every day at a
tremendous rate. Currently, the sizes of such networks are in scale of billions of nodes
and connections. As the network is expanding, both the space requirement to store the
network and time complexity to process the network would increase exponentially.
This imposes a great challenge to the conventional community detection algorithms.
Traditional community detection methods often deal with thousands of nodes or more.
Heterogeneity
Raw media networks comprise multiple types of edges and vertices. Usually, they are
represented as hypergraphs or k-partite graphs. Majority of community detection
algorithms are not applicable to hypergraphs or k-partite graphs. For that reason, it is
common practice to extract simplified network forms that depict partial aspects of the
complex interactions of the original network.
Evolution
Due to highly dynamic nature of social media data, the evolving nature of network
should be taken into account for network analysis applications. So far, the discussion
on community detection has progressed under the silent assumption that the network
under consideration is static. Time awareness should be incorporated in the
community detection approaches.
15. 7
Evaluation
The lack of reliable ground-truth makes the evaluation extremely difficult [7].
Currently the performance of community detection methods is evaluated by manual
inspection. Such anecdotal evaluation procedures require extensive manual effort, are
non-comprehensive and limited to small networks.
Privacy
Privacy is a big concern in social media. Facebook, Google often appear in debates
about privacy. Simple anonymity does not necessarily protect privacy. As private
information is involved, a secure and trustable system is critical. Hence, lot of
valuable information is not made available due to security concerns.
1.7 Thesis Overview
The remainder of thesis is organized as follows. In the next chapter, a background of
the algorithms researched upon and their pros and cons are discussed. The subsequent
parts deal with the development of an improved version of the existing PageRank
algorithm and how this can be used to solve the problem of community expansion.
The third chapter explains the implementation of the proposed improved PageRank
algorithm. The final chapter deals with the comparison between the three algorithms
and a brief performance analysis of the improved algorithm. The thesis ends with a
short conclusion and notes on future scope of these algorithms.
16. 8
CHAPTER 2
LITERATURE SURVEY
2.1 Neighbour Counting Algorithm
The algorithm works in two phases: community discovery phase and the expanding
phase. Discovery is concerned with finding a group of entities that are members of a
community, while expanding seeks to identify the nature of a community given its
membership.
2.1.1 Community Discovery Methods
Members of natural groups in a network will tend to have a high density of
connections between them, with lower connectivity between different groups.
Discovering communities is typically viewed as a clustering problem, with specific
techniques being more applicable to social networks. A large class of methods deal on
a global scale, where every single vertex is assigned to a single community. An
overview of these methods follows.
2.1.1.1 Graph Partition Techniques
Bisection techniques attempt to partition the network into two relatively separate
subgraphs. Several methods are effective to identifying a single bisection, but work
less well on graphs containing many distinct communities. An external decision must
be made to indicate when to stop bisecting, that is, and how many communities exist
in the graph. Methods include:
Max Flow/Min Cut. These methods can produce good bisections, but make
no guarantees about keeping both groups of similar size. Flake et al. give a
min-cut algorithm based on min-cut trees which is able to produce an arbitrary
number of clusters, and can be expanded to produce a hierarchical clustering.
Spectral Bisection. Spectral bisection techniques partition a graph based on
the eigenvectors of its Laplacian. The Laplacian Q of a graph G is defined as
17. 9
Q = D− A, where D is an n×n diagonal matrix with dv, v = d(v) and A is the
adjacency matrix of G. The spectral bisection method finds the eigenvector
corresponding to the second smallest eigenvalue λ2
, and bisects the graph on
whether the eigenvector entry for a vertex is positive or negative. Λ2
is also
called the algebraic connectivity of a graph. A smaller value indicates a better
split into two groups [5].
Kernighan-Lin Algorithm. This heuristic algorithm attempts to greedily
minimize the “external cost” of a partition, which is the sum of the cost of
inter-partition edges. It starts with an initial (possibly random) partition, and
determines the pair of vertices whose swap would produce the largest decrease
in cost. This gives a sequence of vertex swaps which is then scanned to find
the minimum. The procedure is then repeated with the new partition as the
starting point, until convergence on a local minimum is achieved.
2.1.1.2 Hierarchical Clustering
Hierarchical clustering techniques are driven by an application-specific similarity
measure between the groups of vertices of a network [Scott 2000]. Techniques
include:
Agglomerative: In this top-down approach, each vertex initially belongs to its
own cluster. Clusters are merged incrementally in order of increasing cost. In
single linkage clustering, the cost of merging two clusters depends upon the
closest vertex pair spanning them. In complete linkage cluster, the cost is a
sum of the distances of all vertex pairs spanning the clusters. Newman gives
an algorithm based on modularity Q. Given a partition of the vertices, define a
matrix e where eij is the fraction of edges in G between components i and j.
Then Q is defined as
At each step choose to merge the two clusters that cause the greatest increase
in Q. Agglomerative clustering methods do not find peripheral members
18. 10
reliably. An additional level of processing is needed to determine at which
level the hierarchy defines the most meaningful communities.
Divisive: In divisive hierarchical clustering, the entire graph G begins as one
cluster. Edges are removed to partition the cluster into smaller ones, as
opposed to agglomerative where clusters are joined to larger clusters. Girvan
and Newman gave an algorithm based on edge betweenness centrality. The
edge with highest betweenness centrality is removed from the graph until no
edges remain. Edge betweenness can be calculated in O (mn), giving a total
computation time of O(m^2n).
Clauset [2] et al. state that hierarchical structure is actually a defining component of
social networks; sufficient to explain power law degree distributions, high clustering
coefficients, and short path lengths (the small world phenomenon). The hierarchical
random graph model is a dendogram, with probabilities at internal nodes. The
probability of an edge between two leaves is equal to the value in their lowest
common ancestor. This model produces networks exhibiting the properties of small-
world networks. They also give a statistical based algorithm for inferring the most
likely hierarchical random graph model from a given network.
2.1.2 Expanding an existing community
The essential function of a community expansion method is to identify the
most promising next member to add to the community. This is achieved by assigning
a score to all entities in the network, and selecting the highest-scoring outside vertex
to join the community. Given below is a description of several different possible
scoring criteria to rank the selection:
Neighbour Count: The most obvious candidates for incorporation have many
neighbours in the community. Basketball players tend to be associated with
other basketball players, musicians with other musicians, etc.
Juxtaposition Count: One drawback of using a simple neighbour count
criterion is that each neighbour is given the same weight, regardless of the
strength of the relation. The edge weights defining the network are co-
19. 11
occurrence frequencies of the given entity pair. Using such juxtaposition
weights assigns more importance to neighbours that are more frequently
associated in the text with in-community members.
Neighbour Ratio: A failing of such counting scores is that the status of
ubiquitous entities gets artificially elevated. A frequent entity like “George
Bush” has over a thousand neighbours in the graph, and hence will have
neighbours from many communities. Say six of these neighbours are chemists.
The raw neighbour count score would identify George Bush as more likely to
be a chemist than John Dalton, an entity that has only 8 neighbours (5 of
which are chemists). But if the vertex degree is factored in and a ratio used,
Dalton becomes promoted to the most likely chemist.
Juxtaposition Ratio: The bias to ubiquitous entities is also present in
juxtaposition counts. Edges to “George Bush” tend to have high weight,
simply because of the total frequency of the entity. Using a ratio helps control
for high-frequency vertices.
Binomial Probability: Using ratios has the problem of artificially elevating
the importance of infrequent entities. An entity with 100 neighbours, 60 of
which are chemists, would have a neighbour ratio of 0.6. But an entity with a
single neighbour who happened to be a chemist would have a ratio of 1.
Normalize for this by computing the probability PR[n, k] that an entity would
happen to have at least k of its n neighbours from the group by chance. So:
where p is the fraction of known-group members to network nodes. When
Pr[n, k] is extremely low for an observed k in-group neighbours, than it can be
reasoned that the entity must be a member of the community.
Assume the community C, and the set of nodes adjacent to the community, B (each
has at least one neighbour in C). At each step, one or more nodes from B are chosen
and agglomerated into C. Then B is updated to include any newly discovered nodes.
20. 12
This continues until an appropriate stopping criterion is satisfied. When the
algorithms begin, C = {s} and B contains the neighbours of s: B = {n(s)}.
The Clauset algorithm [2] focuses on nodes inside C that form a “border” with B:
each has at least one neighbour in B. Denoting this set Cborder and focusing on
incident edges, Clauset defines the following local modularity:
where βij is the adjacency matrix comprising only those edges with one or more
endpoints in Cborder and [P] = 1 if proposition P is true, and zero otherwise. Each
node in B that can be agglomerated into C will cause a change in R, ∆R, which may
be computed efficiently. At each step, the node with the largest ∆R is agglomerated.
This modularity R lies on the interval 0 ≤ R ≤ 1 (defining R = 1 when |Cborder| = 0) and
local maxima indicate good community separation. For a network of average degree
d, the cost to agglomerate |C| nodes is O(|C|2
d) [6].
The LWP algorithm defines a different local modularity, which is closely related to the
idea of a weak community [9]. Define the number of edges internal and external to C
as Min and Mout, respectively:
The LWP local modularity Mf is then:
When Mf > 1/2, C is a weak community, according to the algorithm consists of
agglomerating every node in B that would cause an increase in Mf, ∆Mf > 0, then
21. 13
removing every node from C that would also lead to ∆Mf > 0 so long as the node’s
removal does not disconnect the subgraph induced by C. (Removed nodes are not
returned to B, they are never re-agglomerated.) Finally B is updated and the process
repeats until a step where the net number of agglomerations is zero. The algorithm
returns a community if Mf > 1 and s ∈ C. Similar to the Clauset method [2], the cost
of agglomerating |C| nodes is O(|C|^2d).
A number of approaches evaluated nodes based on the number of neighbors
they had in and out of the community, adding nodes to the community when they
optimized a function of a specific quantity. Bagrow [1] did this for a measure called
outwardness, defined as the degree-normalized difference between neighbors inside
and outside the community.
The “outwardness” Ωv(C) of node v ∈ B from community C:
where n(v) are the neighbours of v. In other words, the outwardness of a node is the
number of neighbours outside the community minus the number inside, normalized by
the degree. Thus, Ωv has a minimum value of −1 if all neighbours of v are inside C,
and a maximum value of 1 − 2/kv, since any v ∈ B must have at least one neighbour
in C. Since finding a community corresponds to maximizing its internal edges while
minimizing external ones, agglomerate the node with the smallest Ω at each step,
breaking ties at random.
Figure 2.1 a: The community C is surrounded by a boundary of explored nodes B.
This exploration implies an additional layer of nodes that are known only due to their
adjacencies with B.
Figure 2.1 b: Two nodes i and j in B, with Ωi = 2/3 and Ωj = −1. Moving node j into C
will give improved community structure, compared to moving i.
22. 14
Figure 2.1(a) and (b): Neighbour counting explanation.
2.2 Greedy Algorithm
Greedy algorithm for maximising modularity
Input: graph G = (V, E)
Output: clustering C of G
C ← singletons
initialize matrix ∆
while |C| > 1
do
find {i, j} with ∆i,j is the maximum entry in the matrix
∆
merge clusters i and j
update ∆
return clustering with highest modularity
23. 15
The greedy algorithm starts with the singleton clustering and iteratively merges those
two clusters that yield a clustering with the best modularity, i. e., the largest increase
or the smallest decrease is chosen. After n−1 merges the clustering that achieved the
highest modularity is return. The algorithm maintains a symmetric matrix ∆ with
entries ∆i,j := q (Ci,j) − q (C), where C is the current clustering and Ci,j is obtained
from C by merging clusters Ci and Cj . Note that there can be several pairs i and j
such that ∆i,j is the maximum, in these cases the algorithm selects an arbitrary pair.
The pseudo-code for the greedy algorithm is given in Algorithm 1.
An efficient implementation using sophisticated data-structures requires O (n^2*log
n) runtime[1]. Note that, n−1 iterations is an upper bound and one can terminate the
algorithms when the matrix ∆ contains only non-positive entries. This property is
called single-peakedness. Since it is N P-hard to maximize modularity [1] in general
graphs, it is unlikely that this greedy algorithm is optimal. In fact, in the graph family,
the above greedy algorithm has an approximation factor of 2, asymptotically.
Furthermore, we point out instances where a specific way of breaking ties of merges
yield a clustering with modularity of 0, while the optimum clustering has a strictly
positive score. Modularity is defined such that it takes values in the interval [−1/2, 1]
for any graph and any clustering. In particular the modularity of a trivial clustering
placing all vertices into a single cluster has a value of 0. We use this technical
peculiarity to show that the greedy algorithm has an unbounded approximation ratio.
24. 16
2. 3 Drawbacks of Greedy Algorithm
Asymptotic growth of value of a metric implies a strong dependence on the
size of the network and the number of modules the network contains [6].
Resolution limit is a problem where communities of certain small size are
merged into larger ones [6]. A classic example where modularity cannot
identify communities of small size is a cycle of m cliques. Here maximum
modularity is obtained if two neighbouring cliques are merged [4].
Degeneracy of solution is a problem where a community scoring function (e.g.
modularity) admits multiple distinct high-scoring solutions and typically lacks
a clear global maximum, thereby, resorting to tie-breaking [6].
2.4 Drawbacks of PageRank Algorithm
Modularity is a property of the network that measures when the division is good, in
the sense that there are many edges within the community and only a few between
them. In modularity based algorithm's, each node of the graph is considered as an
individual community and the communities are joined iteratively based on the
increase in modularity caused by their joining. The ones producing maximum change
in modularity are joined. There are few drawbacks associated with modularity based
methods such as they require information regarding the entire structure of the network
which is not possible to determine in case of vast real world networks. Also
modularity optimization methods are not able to determine the overlapping
communities [1]. In order to detect overlapping communities clique percolation can
be used . Clique percolation is based on the assumption that a community consists of
fully connected subgraphs and detects overlapping communities by searching for
adjacent cliques. But it is a hard method to implement well due to difficulty of
producing intermediate representations [4] of percolating structures.
25. 17
CHAPTER 3
IMPLEMENTATION
3.1 Finding potential members
The main aim is finding all the potential members that can be added into the
community. The algorithm basically considers all the neighbours of the given
community .All the neighbours are extracted into ADJ[] array. The algorithm grows in
a dynamic fashion. The member with the highest pagerank is added into the
community. Then a fresh set of neighbour is extracted into the ADJ[] array
considering the new member added to the community and the same process is
repeated. A flag array CHECK_ARR[] is given to check if the neighbouring member
has the same set of interest as the community members.
Figure 3.1(a): Finding potential neighbour step one
26. 18
Figure 3.1(b): Finding potential neighbour step two
1N : 1-hop neighbour
2N: 2-hop neighbour
C: Community
Figure 3.1(a) and 3.1(b) is a snapshot of a very big graph (the input, a webgraph). The
initial set of neighbours are (P, Q, R, S) Assume that the PageRank(R) is the highest
among all the neighbours. So, the node R is added to the community. Now the next
list of 1N (one hop) neighbours are considered, that is, neighbours (P, Q, T, S) are
considered. So the algorithm proceeds in a dynamic fashion. At every step it checks if
the neighbour has the same set of interest as that of the community. If yes, the
neighbour node will be added to the ADJ[] array, else it will pass on to the next hop
neighbour.
Figure 3.2 depicts a flowchart that shows how the expansion algorithm executes to
find potential members and add them to the community.
27. 19
Figure 3.2: Flowchart depicting the algorithm
3.2 Improved PageRank
The proposed algorithm is based on mean value of page ranks of all web pages with
performance advantages over the traditional Page Rank algorithm. A novel approach
for reducing the number of iterations performed in Page Rank algorithm to reach a
convergence point:
Initially assume PAGE RANK of all web pages to be any value, let it be 1.
Calculate page ranks of all pages by following formula
PR(A) = .15/N + .85 (PR(T1)/C(T1) +PR(T2)/C(T2) + ....... + PR(Tn)/C(Tn))
o T1 through Tn are pages providing incoming links to Page A
o PR(T1) is the Page Rank of T1
Construct the neighbour set of
the given community.
Find the neighbour with maximum PageRank. Add
the neighbour to the community.
Construct a fresh neighbour set considering the
neighbours of the newly added member.
Repeat steps 2 and 3 until stopping
criteria is met.
Make sure the neighbour has
same set of interest as that of
Community, else, consider
the next one hop neighbour.
28. 20
o PR(Tn) is the Page Rank of Tn
o C(Tn) is total number of outgoing links on Tn
o N is the total number of nodes available in the graph.
Calculate mean value of all page ranks by following formula :-
o Summation of page ranks of all web pages / number of web pages.
Then normalize page rank of each page
o Norm PR (A) = PR (A) / mean value
o Where norm PR (A) is Normalized Page Rank of page A and PR (A) is
page rank of page A
Assign PR(A)= Norm PR (A)
Repeat step 2 to step 4 until page rank values of two consecutive iterations are same.
o The pages which have the highest page rank are more significant
pages.
When running the original PageRank algorithm, the values of individual PageRanks
of webpages keeps oscillating about their final value. Saturation is reached after a
number of iterations when the value converges to a single value according to a
convergence factor.
In the proposed improved PageRank algorithm, this oscillation is minimized by
normalizing the PageRank values received after every iteration thereby bringing the
current value closer to the saturation point every time. The procedure for the
execution of proposed improved PageRank algorithm has been neatly depicted as a
flowchart in figure 3.3.
29. 21
Figure 3.3: Flowchart depicting Improved PageRank Algorithm
Input:
Web graph with 50000 vertices (number from 0 to 49999) and 50000 edges.
Nodes that are part of the community(around 100 nodes)
30. 22
Figure 3.4: Input webgraph (50000 vertices and 50000 edges)
Figure 3.5: Existing community members
Output:
PageRank of each and every node calculated using the improved page rank
algorithm.
Potential members that can be added to the community, decided by using the
PageRank values obtained above.
31. 23
Figure 3.6: Calculating PageRank using the existing code.
Figure 3.7: Calculating PageRank using improved algorithm.
32. 24
0
10
20
30
40
50
60
70
80
30 40 50 60 70 80 90 100
NumberofIterationstofindPageRank
Number of Nodes (millions)
PageRank
Improved PageRank
Figure 3.8: Using the PageRank values got above, the potential members that can
be added to the community are determined.
c) Performance improvement graphs – Traditional PageRank vs Improved PageRank
Figure 3.9: Graph showing analysis between PageRank and
Improved page rank.
34. 26
Figure 4.1 show the process of employing neighbour counting algorithm to expand a
given community. Also, the modularity of the new community is calculated. The next
possible addition to the community is chosen by the property of outwardness(O) [1];
the immediate neighbour with the least outwardness is the most potent member in the
community.
4.2 Greedy algorithm
Q = 0.1152
O(5) = 0.75
O(6) = 0.25
Q = 0.1367
O(6) = 0.20
O(7) = 0.00
O(8) = 0.40
Q = 0.1758
O(6) = 0.167
O(8) = 0.00
Q = 0.2207
O(6) = -0.143
Q = 0.2871
Figure 4.2: Greedy algorithm steps
35. 27
The PageRank of all the nodes is calculated using the greedy algorithm. The expanded
community in different iterations are depicted in figure 4.2. Adding more members to
the community is decided based on the modularity value of the attained subgraph. The
outwardness(O) of the options are listed.
4.3 PageRank Algorithm
Q = 0.1152
O(5) = 0.75
O(6) = 0.25
Q = 0.1308
O(5) = 0.20
O(8) = 0.00
Q = 0.1309
O(5) = -0.167
O(7) = 0.00
Q = 0.2480
O(5) = -0.4285
Q = 0.2871
Figure 4.3: Proposed improved PageRank algorithm steps
36. 28
The network shown in figure 4.3 depicts the expansion process using proposed
improved pagerank algorithm.
4.4 Comparing performances of PageRank and the proposed Improved
PageRank
The existing PageRank algorithm and the proposed Improved PageRank algorithm
were executed to find the PageRank of all the nodes in the previous graph. Each graph
from figure 4.4(a) to 4.4(e) represents an iteration of the process.
Figure 4.4(a): Graph showing iteration 1 PageRank values
Figure 4.4(b): Graph showing iteration 2 PageRank values
38. 30
Figure 4.4(e): Graph showing iteration 2 PageRank values
Figure 4.5 and 4.6 are graphs which show a comparison between the three algorithms
which were tested based on two different characteristics: modularity and
outwardness.
Figure 4.5: Modularity- Comparison between the three algorithms.
39. 31
Figure 4.6: Outwardness- Comparison between the three algorithms.
4.5 Inference
Modularity is found to be continually increasing for all three algorithms. Since the
greedy structural optimization method relies on adding members to the group based
on maximizing modularity, it is found to show better results than neighbour counting.
Neighbour counting proceeds by adding nodes with least outwardness value and
hence the resulting community is found to have lesser outwardness than greedy
structural optimization and PageRank algorithm.
Finally, it has been found that in the longer run, PageRank algorithm deduces a graph
with higher modularity and comparable outwardness to both greedy structural
optimization and neighbour counting algorithm. This makes the final result more
efficient thereby letting the system add more potent neighbours into the given
community.
The proposed improved PageRank algorithm reduces the number of iterations taken to
reach saturation in comparison to traditional pagerank algorithm. This reduces the
time taken to propose new members to be added to the community thereby improving
the efficiency in expanding a given community using seed sets. This effect is more
pronounced when the data set is of the order of one lakh nodes.
40. 32
CONCLUSION AND FUTURE WORK
The seed set expansion problem has its roots in a number of overlapping areas,
including the problem of identifying central nodes in social networks [3] and finding
related and/or important Web pages from an initial set of query results [3]. In
particular, the PageRank algorithm broadened from its initial focus on Web search [9]
to also include methods for finding nodes “similar” to an initial root, by starting short
random walks from the root and seeing which other nodes were likely to be reached
[3].
The seed set expansion problem has been gaining visibility as a general-purpose
framework for identifying members of a networked community from a small set of
initial examples. But subtle trade-offs in the formulation and underlying methods can
have a significant impact on the way this process works, and in this project, several
such principles have been identified about the relative power of different expansion
heuristics, and the structural properties of the initial seed set. The investigations have
involved analyses of datasets across diverse domains as well as theoretical trade-offs
between different problem formulations. There are a number of interesting directions
for further work.
In particular, the power of PageRank-based methods raises the question of whether
these are indeed the “right” algorithms for seed set expansion, or whether they should
be viewed as proxies for a richer set of probabilistic approaches that could yield
strong performance. Second, the damping factor which is assumed to be a constant
value can be changed and it can be said that over different seed sets tampering with
varying values of the damping factor could lead to anomalies and special cases which
need to be carefully studied; a richer understanding of the seed sets that lead to the
most effective expansions to a larger community could provide useful insights for the
application of these methods. And finally, as noted earlier, nodes in a network tend to
belong to multiple communities simultaneously, and a robust way of expanding
several overlapping communities together is a natural question for further study.
41. 33
REFERENCES
[1] James P Bagrow. “Evaluating local community methods in networks “. Journal
of Statistical Mechanics: Theory and Experiment, pages:15-19, 2008.
[2] Aaron Clauset. Finding local community structure in networks. Physical review
E, 72(2):026132, 2005.
[3] Isabel M. Kloumann and Jon M. Kleinberg. "Community membership
identification from small seed sets”. KDD, 2014.
[4] Andrew Mehler and Steven Skiena. “Expanding network communities from
representative examples”. ACM Transactions on Knowledge Discovery from Data
(TKDD), pages:14-19, 2009.
[5] Jaewon Yang and Jure Leskovec. “Defining and evaluating network
communities based on ground-truth”. In In MDS ’12, page 3. ACM, 2012.
[6] J. Leskovec, K. J. Lang, and M. Mahoney. “Empirical comparison of
algorithms for network community detection”. In WWW, pages 631-640, New York,
USA, 2010.
[7] Reid Andersen, Fan Chung, and Kevin Lang. “Local graph partitioning using
pagerank vectors”. In Foundations of Computer Science,pages 475-486, 2006.
[8] Alan Mislove, Bimal Viswanath, Krishna P Gummadi, and Peter Druschel.
“You are who you know: inferring user profiles in online social networks”. In In
WSDM ’10, pages 251–260. ACM, 2010.
[9] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. “The
pagerank citation ranking: Bringing order to the web”. 2011.
42. 34
BIBLIOGRAPHY
[1] Community Detection in Social Media – www.slideshare.com
[2] Empirical Comparison of Algorithms in Network Community
Detection– dl.acm.org
[3] Louvain Method for Community Detection– perso.uclouvain.be
[4] Applications of Community Detection – royalsocietypublishing.org
[5] IEEE Xplore – ieeexplore.ieee.org
[6] Wikipedia – wikipedia.org
[7] SNAP – Stanford Database Collection
46. 38
APPENDIX B
GLOSSARY OF TERMS
[1] Conductance
In graph theory the conductance of a graph G=(V,E) measures how "well-knit" the
graph is: it controls how fast a random walk on G converges to a uniform distribution.
The conductance of a graph is often called the Cheeger constant of a graph as the
analog of its counterpart in spectral geometry [8].
[2] Modularity
Modularity is one measure of the structure of networks or graphs. It was designed to
measure the strength of division of a network into modules (also called groups,
clusters or communities). Networks with high modularity have dense connections
between the nodes within modules but sparse connections between nodes in different
modules. Modularity is often used in optimization methods for detecting community
structure in networks.
[3] Ground-Truth Community
Generally, after communities are identified in a given network, the essential next step
is to interpret them by identifying a common external property [5] that all the
members share and around which the community organizes . Thus, the goal of
network community detection is to identify sets of nodes with a common (often
external/latent/unobserved) property based only the network connectivity structure. A
“common property” can be common attribute, affiliation, role, property, or function..
A distinction is made between network communities and groups. A community is
defined structurally (i.e., a set of nodes extracted by the community detection
47. 39
algorithm), while a group is defined based on nodes sharing a property around which
the nodes organize in the network (e.g., belonging to a common interest based group,
sharing common affiliation)
Using the ground-truth communities allows for quantitative and large-scale
evaluation[5] and comparison of different community detection methods. Such
ability represents a significant step forward as the field can move beyond the current
standard of anecdotal evaluation of communities to comprehensive evaluation of the
performance of community detection methods. Ground-truth communities are
structurally most similar to the communities discovered by random walk method[5].
[4] Ego Networks
Ego networks consist of a focal node ("ego") and the nodes to whom ego is directly
connected to (these are called "alters") plus the ties, if any, among the alters. Of
course, each alter in an ego network has his/her own ego network, and all ego
networks interlock to form The human social network the denser the ties in an ego
network, the stronger the ties, and the more insular the ego network and also the more
homogeneous.
Typical measures:
Homophily
Size
Average strength of ties
Heterogeneity
Density
Composition (e.g., % women, %whites, etc.)
Range: Substantively defined as potential access to social resources often
defined as diversity of alters based on weak ties argument, density is thought
of as inverse measure of range size and heterogeneity also seen as measures of
range.
48. 40
[5] Outwardness
Outwardness of a node is the number of neighbours outside the community minus the
number inside, normalized by the degree [6]. Thus, Ωv has a minimum value of −1 if
all neighbours of v are inside C, and a maximum value of 1 − 2/kv, since any v ∈ B
must have at least one neighbours in C [4]. Since finding a community corresponds to
maximizing its internal edges while minimizing external ones, agglomerate the node
with the smallest Ω at each step, breaking ties at random.