Productive Team Formation (Social Networks Analysis)

879 views

Published on

ABSTRACT
Recently, there has been tremendous interest in the phenomenon of social networks
analysis. A social network is a social structure composed of a set of social actors
and a complex set of the interactions between them. Co-authorship network is a well
known example of social networks. Most of the previous studies consider the co-
authorship relation between two or more authors as a collaboration. Co-authorship
network have been studied extensively from various perspectives such as degree dis-
tribution analysis, social community extraction and social entity ranking.
An interesting problem using co-authorship networks is formation of productive team
for a new research lab. The static version of the problem is relatively well studied
which involves hub identification and then forming a team using various combinato-
rial algorithms. A more interesting variant of this problem would be to take into ac-
count the time dimension and constraint of fixed budget (in order to hire researchers
for the team) and team size in such a way that accumulative productivity of the team
in future is maximized over research community. Productivity in case of research
community can be quantified in terms of various quantities e.g. research volume and
collaboration diversity within the research community.
Results of experiments on large co-authorship network with 2 million collaborations
with 0.6 million collaborators suggest that a good extent of information about future
productivity can be extracted from the present network topology.

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
879
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Productive Team Formation (Social Networks Analysis)

  1. 1. Master DMKM Report Productive Team Formation Muhammad AHSAN Defended on 09/07/2013
  2. 2. “Scientific research consists in seeing what everyone else has seen, but thinking what no one else has thought”. – Unknown
  3. 3. SUPERVISORS Berkant Barla Cambazoglu (Yahoo!) Jean-Gabriel Ganascia (University Pierre et Marie Curie)
  4. 4. Contents Abstract 1 Acknowledgments 3 Abstract 5 1 Introduction 1.1 Social Networks . . . . 1.2 Co-authorship Network 1.3 Report Organization . 1.4 Research Domain . . . 7 7 8 9 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work 11 2.1 Graph Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Team Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Problem Formalization 13 3.1 Researcher’s Productivity . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Productive Team Formation . . . . . . . . . . . . . . . . . . . . . . . 13 4 Experimental Dataset 4.1 Dataset Description . . . . . . . . . . 4.1.1 Dataset Name . . . . . . . . . 4.1.2 Dataset Source . . . . . . . . 4.1.3 Dataset Purpose . . . . . . . 4.1.4 Dataset Format . . . . . . . . 4.2 Dataset Statistics . . . . . . . . . . . 4.2.1 Publication dates . . . . . . . 4.2.2 Publication volume by type . 4.2.3 Average authors per paper . . 4.2.4 Publications per Year . . . . . 4.2.5 Authors per publication . . . 4.2.6 Publications per author . . . 4.2.7 Records in DBLP (grouped by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . year) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 15 15 15 16 16 17 17 17 18 19 19 20 i
  5. 5. Contents Contents 5 Experimental Setup 5.1 Layered Approach . . . . . . . . . . . 5.1.1 Dataset Preprocessing . . . . 5.2 Features Selection . . . . . . . . . . . 5.2.1 Cost Estimation Model . . . . 5.2.2 Link Prediction Problem . . . 5.3 Supervised Machine Leaning . . . . . 5.3.1 Train, Validate and Test Sets 5.3.2 Regression Problem . . . . . . 5.4 Team Formation . . . . . . . . . . . 5.4.1 0-1 Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 22 23 23 26 26 27 27 27 6 Experimental Results & Conclusion 6.1 Machine Learning Performance 6.1.1 Regression Models . . . 6.1.2 Baseline for performance 6.1.3 Result’s Evaluation . . . 6.2 Team Selection Performance . . 6.2.1 Baseline for performance 6.2.2 Result’s Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 29 30 30 30 31 31 . . . . . . . . . . . . . . . . . . . . . 7 Future Work 33 7.1 Scientific Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.2 Problem Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.3 Technical Achievements . . . . . . . . . . . . . . . . . . . . . . . . . 34 Bibliography ii 35
  6. 6. Abstract Recently, there has been tremendous interest in the phenomenon of social networks analysis. A social network is a social structure composed of a set of social actors and a complex set of the interactions between them. Co-authorship network is a well known example of social networks. Most of the previous studies consider the coauthorship relation between two or more authors as a collaboration. Co-authorship network have been studied extensively from various perspectives such as degree distribution analysis, social community extraction and social entity ranking. An interesting problem using co-authorship networks is formation of productive team for a new research lab. The static version of the problem is relatively well studied which involves hub identification and then forming a team using various combinatorial algorithms. A more interesting variant of this problem would be to take into account the time dimension and constraint of fixed budget (in order to hire researchers for the team) and team size in such a way that accumulative productivity of the team in future is maximized over research community. Productivity in case of research community can be quantified in terms of various quantities e.g. research volume and collaboration diversity within the research community. Results of experiments on large co-authorship network with 2 million collaborations with 0.6 million collaborators suggest that a good extent of information about future productivity can be extracted from the present network topology. Résumé Récemment, il ya eu un intérêt considérable dans le phénomène de l’analyse des réseaux sociaux. Un réseau social est une structure sociale composée d’un ensemble d’acteurs sociaux et d’un ensemble complexe d’interactions entre eux. Réseau de coauteur est un exemple bien connu des réseaux sociaux. La plupart des études précédentes considérer la relation de co-auteur entre deux ou plusieurs auteurs comme une collaboration. Réseau de co-auteur ont été largement étudié sous différents angles tels que l’analyse de la distribution des degrés, l’extraction de la communauté sociale et le classement de l’entité sociale. Un problème intéressant en utilisant les réseaux de coauteur est la formation de l’équipe de production pour un nouveau laboratoire de recherche. La version statique du problème est relativement bien étudié qui implique l’identification de moyeu et ensuite former une équipe en utilisant différents algorithmes combinatoires. Une variante plus intéressante de ce problème serait de prendre en compte la dimension temporelle et la contrainte de budget fixe (afin d’embaucher des chercheurs de l’équipe) et la taille de l’équipe de telle sorte que la productivité cumulée de l’équipe à l’avenir est maximisée sur la recherche communauté. Productivité dans le cas de la communauté de la recherche peut être quantifiée en termes de quantités différentes,
  7. 7. par exemple le volume de la recherche et de la diversité de la collaboration au sein de la communauté des chercheurs. Les résultats des expériences sur grand réseau de co-auteur avec 2 millions de collaborations avec 0,6 million de collaborateurs suggèrent qu’une bonne quantité de renseignements sur la productivité future peut être extrait de la présente topologie du réseau. 2
  8. 8. Acknowledgments I would like to pay a special gratitude to my thesis supervisor at Yahoo, Dr. Berkant Barla Cambazoglu, whose contributions in stimulating suggestions and encouragement, helped me to coordinate my thesis especially in writing this report. A special thank to my supervisors from DMKM, Prof Jean-Gabriel Ganascia and Prof Lorenza Saitta. Furthermore I would also like to acknowledge with much appreciation the crucial role of Dr. Amin Mantrach at Yahoo, who was always willing to help by giving me valuable suggestions and critique. I also acknowledge the guidance given by DMKM evaluation panel for their comments and advices. I thank my family for their continuous support and encouragement through their best wishes. I would also like mention my friends at Yahoo who made my internship duration at Yahoo, memorable one. 3
  9. 9. Host Institution Yahoo! Labs Barcelona is the host institution which specializes in research on web search and data mining. The lab works in close collaboration with the European research community. Yahoo! Inc. is a multinational internet corporation headquartered in Sunnyvale, California. It is widely known for its web portal, search engine Yahoo! Search, and related services including Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Finance, Yahoo! Groups, Yahoo! Answers, Flickr and many others. Yahoo! Research is responsible for inventing new sciences ideally suited to Yahoo!’s needs. This includes importing new ideas and implementing them for real-world use. These innovations have fueled scores of scientific advances for Yahoo!’s products and businesses. Some of the research areas are computational advertising, human computer interaction, machine learning, systems research and web mining. 5
  10. 10. 1 Introduction 1.1 Social Networks A social network [1] is a social structure composed of a set of social actors and a complex set of the interactions between them. SNA (Social Network Analysis) is an inherently interdisciplinary academic field having its origins in both social science and in the broader fields of network analysis and graph theory. Network analysis concerns itself with the formulation and solution of problems that have a network structure. Graph theory provides a set of abstract concepts and methods for the analysis of graphs. These, in combination with other analytical tools form the basis of SNA methods. SNA is not just a methodology; it is a unique perspective on how society functions. Instead of focusing on individuals and their attributes, or on macroscopic social structures, it centers on relations between individuals, groups, or social institutions. Studying society from a network perspective is to study individuals as embedded in a network of relations and seek explanations for social behavior in the structure of these networks rather than in the individuals alone. Such network perspective has become focal point of many research studies. SNA has a long history in social science, although much of the work in advancing its methods has also come from mathematicians, physicists, biologists and computer scientists. The idea that networks of relations are important in social science is not new, but widespread availability of data and advances in computing and methodology have made it much easier now to apply SNA to a range of problems. Businesses use SNA to analyze and improve communication flow with their networks of partners and customers. Law enforcement agencies use SNA to identify criminal and terrorist networks from traces of communication and then identify key players in these networks. Social network sites like Facebook use basic elements of SNA to identify and recommend potential friends based on friends-of-friends. Civil society organizations use SNA methods to uncover conflicts of interest in hidden connections between government bodies, lobbies and businesses. Network operators use SNA methods to optimize the structure and capacity of their networks. As part of the recent surge of research on large, complex networks and their properties, a considerable amount of attention has been devoted to the computational analysis of social networks structures whose nodes embedded in a social context and whose edges represent interaction or collaboration between nodes. Both nodes and edges can be defined in different way depending on the domain of interest. 7
  11. 11. 1.2 Co-authorship Network Co-authorship network is an important type of social networks. Co-authorship relation between two or more authors is known as a collaboration. Co-authorship network have been studied extensively from various perspectives such as degree distribution analysis, social community extraction and social entity ranking [2]. The availability of large, detailed datasets encoding such networks has stimulated extensive study of their basic properties, and the identification of recurring structural features [3, 4]. Social networks are usually self-organizing, emergent and complex. All globally coherent pattern appears from the local interaction of the elements that make up the network. These patterns become more apparent as network size increases. Understanding the mechanisms by which they evolve is a fundamental question which forms the motivation for this research work. During this research work, an interesting variant of computational problem underlying social network evolution and team formation is studied. Team formation within a social network is an interesting problem. The static version of the problem is relatively well studied which involves hub identification and team selection using various combinatorial algorithm. A more interesting variant of this problem would be to take into account the time dimension. From the perspective of co-authorship networks, a team of researchers is to be formed for a new research lab. Each researcher is associated with hiring cost. Budget for hiring team and its size is fixed. The task is to build a team out of available pool of researchers in co-authorship network in such a way that accumulative productivity of the team is maximized at a future point in time e.g. 5 years after formation of research lab. Productivity quantification is dependent on the use case e.g. productivity in scientific research community can be quantified using metrics like volume of publications and collaborations diversity of each researcher. The problem is composed of two phases. First phase is productivity prediction in future using various current graph features and second phase is team formation through combinatorial optimization using the previous productivity prediction models. The critical challenge of first phase is to what extent, the evolution of a social network be modeled using features intrinsic to the network. In case of a co-authorship network, there are many network exogenous reasons to why two researchers who have never collaborated before, will do so in the next few years. e.g. they may happen to become geographically close when one of them changes organization. Such external collaborations can be hard to predict. Nevertheless, a large number of new collaborations are hinted at by the topology of the network e.g. two authors are close in the network via common colleague, will travel in similar circles hence are more likely to collaborate in the near future[5]. By making this intuitive notion precise and understanding measures of “proximity” in the network will lead to the most accurate predictions. A number of proximity measures lead to predictions that outperforms chance, indicating that the network topology does indeed contain latent information to infer 8
  12. 12. future interactions. Once the prediction model is ready, the team selection heuristic can use this model to optimally select researchers for team with maximum productivity per unit cost. 1.3 Report Organization This report is organized into seven chapters. First chapter presents intuitive introduction to problem. The second chapter point towards the research survey and related work. Third chapter presents problem formalization using mathematical formalism. Fourth chapter presents dataset analysis of co-authorship graph being used for experimentation. Fifth chapter discuss about experimental setup and process design. Sixth chapter discuss results of experimentation. Seventh chapter provides conclusion, analysis and points possible future extensions of the problem. 1.4 Research Domain Following are the relevant research domain for this work 1. Graph Mining 2. Combinatorial Optimization Major focus during this work is on graph mining. The methods and techniques are accessed empirically by different experimentation and the results confirm the effectiveness of the applied techniques. 9
  13. 13. 2 Related Work 2.1 Graph Mining The research topic of social network analysis is not new and has been the focus of many scholars in anthropology and psychology for many decades. Detecting communities in a social network structure has also been pursued by sociologists and more recently physicists and applied mathematicians with applications to social and biological networks [6, 7]. The idea behind viral marketing [8] is that by targeting the most influential or central users in the network we can activate a chain-reaction of influence driven by word-of- mouth, in such a way that with a very small marketing cost we can actually reach a very large portion of the network. The problem of finding hubs or celebrities [5] and in a social network is a basic and well studies problem. Liben-Nowell and Kleinberg [5] introduce the link prediction problem and show that simple graph-theoretic measures, such as the number of common neighbors, are sufficient to efficiently detect links that are likely to appear in a social network. But most of the research work does not focus on time dimension since the centrality of any node in a network keeps on changing [9, 10]. The centrality of nodes in a community measures the relative importance of a node within that group. There are many measures of centrality that could be parameter to the algorithm, namely degree, betweenness, closeness and eigen vector centrality measures [5]. Through the use of more elaborated measures that consider the ensemble of all paths between two nodes (e.g. the Katz measure), they further improve the prediction quality. The graph features used for graph mining are inspired by Liben-Nowell and Kleinberg’s [5] work but they are adapted to probabilistic graphs. Taskar et al. [11] apply link prediction to a social network of universities. They rely on machine learning techniques and use personal information of users to increase the accuracy of predictions. Following a similar approach, O Madadhain et al. focus on predicting events between entities and use the geographic location as a feature. Clauset et al. [12] apply link prediction to biology and physics using hierarchical models in order to detect links that have not been observed during experimentation. All these approaches rely on the availability of an initial link structure for prediction. Van der Aalstetal et al. [13] extract a social network from logs of interactions between workers in a company. Similar works include mining email communications [14] and proximity interactions [15]. 11
  14. 14. 2.2 Team Formation There is a considerable amount of literature on team formation in the operations research community[16, 17, 18]. A trend in this line of work is to formulate the team formation problem as an integer linear program, and then focus on finding an optimal match between people and the demanded functional requirements. The problem is often solved using techniques such as simulated annealing, branch-and-cut or genetic algorithms [19, 18]. Lappas et al. [20] introduce the problem of team formation in social networks. The objective is to minimize the coordination cost e.g. in terms of diameter or weight of the minimum spanning tree for the team. This problem has been extended to cases in which potentially more than one member possessing each skill is required, and where density based measures are used as objectives [21, 22]. It has also been extended to allow partial coverage of the required skills, introducing a multi-objective optimization problem that is optimized using simulated annealing [23]. More recently, Kargar and An [24], consider a variation with a different cost model: when a user who participates in a task contributes with a variety of his skills, the contribution to the cost is independent for each skill. 12
  15. 15. 3 Problem Formalization 3.1 Researcher’s Productivity Increasingly, scientific research and development is looked upon to drive economic growth. With public expenditure on R&D making up an increasing proportion of GDP, governments and research funding bodies everywhere want to maximize the return on their investment. Productivity of a researcher is a time variant quantity which depends upon various factor. The two most important factors are the publication volume and publication diversity i.e. how many different collaborator as researcher have. In order to quantify productivity, product of these two quantities is proposed here which is an intuitive measure. At any given time t for any author ai productivity P t (ai ) is defined as a product of his publication volume Γt (ai ) and collaboration volume Γt pub collab (ai ) P t (ai ) = Γt (ai ) ∗ Γt pub collab (ai ) (3.1) 3.2 Productive Team Formation For a new research lab, a team T ∗ of size S ∈ N is to be formed out of available pool T of researchers. A fixed amount of budget B ∈ R is allocated in order to hire the researchers. Each researcher ai is associated with a hiring cost C t (ai ) at time t. The goal is to select T ∗ in such a way that within the constraint of budget B ∈ R and size S ∈ N, the accumulative productivity P t+w (ai ) at time t + w is maximized. T ∗ = argmaxT where (P t+w (ai )) ∗ xi xi ∈ {0, 1} and w ≥ 0 C t (ai ) ∗ xi ≤ B where where (3.2) ai∈T (3.3) ai ∈T B∈R ||T || ≤ S S∈N (3.4) 13
  16. 16. This problem is exactly a knapsack except for a difference that dimension of time is added. So in such a scenario, the non zero weight i.e. productivity per unit cost, is predicted weight at time t + w of each researcher rather than the weight at current time t. Prediction window w ≥ 0 is the time span after when the evaluation of the newly established research lab will carry out. To measure the performance of selected team, the baseline is to select team using prediction window w = 0 where team is selected using current productivity at time t. The information about each researcher ai at time t is available in dataset. 14
  17. 17. 4 Experimental Dataset The dataset selected for experimentation is a co-authorship data where information about any scientific document (e.g. computer science) along with its authors is provided in a semi structured form of XML. 4.1 Dataset Description 4.1.1 Dataset Name DataBase systems and Logic Programming (DBLP) server provides bibliographic information on major computer science journals and proceedings. Initially it was focused on databases and logic programming but now it is gradually being expanded toward most other fields of computer science. So the initial acronym has lost its meaning. It is a bibliography server and not a document repository or delivery service. 4.1.2 Dataset Source Until summer 2011, DBLP was produced at the computer science department of the University of Trier. Now it is a joint project of Schloss Dagstuhl - Leibniz Center for Informatics and the University of Trier. 4.1.3 Dataset Purpose The bibliography evolved from an early small experimental web server to a popular service for the computer science community. DBLP listed more than 2.1 million articles on computer science in November 2012. All important journals on computer science are tracked. Proceedings papers of many conferences are also tracked. It is mirrored at five sites across the internet[25]. In June 2009 the DBLP bibliography contained more than 1.2 million bibliographic records. For computer science researchers, this is a popular tool to trace the work of colleagues and to retrieve bibliographic details when composing the lists of references for new papers. Ranking and profiling of persons, institutions, journals, or conferences is another but sometimes controversial usage of DBLP. It has enabled to browse offline more than 1,000,000 Publications and about 600,000 authors. 15
  18. 18. 4.1.4 Dataset Format DBLP was started in 1993 as a pure HTML application. Later essential parts were converted to XML but HTML style still remains as input language for data entry. The bibliographic records are contained in a large XML data format[26]. The sample snippet shows the arrangement of records inside the XML data <?xml v e r s i o n = " 1 . 0 " e n c o d i n g ="ISO−8859−1"?> <dblp> record 1 ... record n </dblp> The DBLP bibliography contains records. Each record provides complete information about a publication, its authors, type and publishing year as shown in listing. Each record can be and article or in-proceeding or proceeding or book or in-collection or PhD thesis or masters thesis or URL. The detailed content of bibliography with each element is mentioned in [25, 26] 4.2 Dataset Statistics The data statistics are given over a period of time starting from 1936 to 2013. Due to incompleteness of data at the starting and ending periods, both the starting (1936-1939) and ending (2010-2013) period are discarded. The dataset used for experimentation will be from 1940 to 2009. Below are the snapshot statistics at 2009 which will be considered as evaluation year for performance. General Statistics Total number of entries Total number of authors Journal Statistics Total number of journals indexed Total number of journal articles Conference Statistics Total number of conferences indexed Total number of conference papers Book Statistics Total number of books 1158648 696360 764 444691 4371 712531 1426 Some other interesting trends in the dataset are shown as under 16
  19. 19. 4.2.1 Publication dates The distribution clearly shows exponential increase in the trend of publications with time. Figure 4.1: Publication dates 4.2.2 Publication volume by type This Figure 5.2 depicts the share of the different publication types in the bibliography. As some of the types exist very rarely in the database. Majority of the share is enjoyed by in-proceeding and articles. Figure 4.2: Publication volume by type 4.2.3 Average authors per paper This figure 5.3 shows the average number of authors per paper after 1980 which is the start year for publication boom in computer science research. 17
  20. 20. Figure 4.3: Average authors per paper 4.2.4 Publications per Year In Figure 5.4, the different publications are grouped by their publication type and publication year. The diagram shows the number of publications of one type per year. It is evident that after 1985 the number of publications per year started to rise exponentially which may be due to dot-com bubble. Figure 4.4: Publications per Year 18
  21. 21. 4.2.5 Authors per publication It is evident from the Figure 5.5 that only few of the authors have large number of publications which indicates them to be having high impact on current community which also reflects author’s productivity. A large number of the authors have less than 3 publications. It also indicates the sparsity of the network. Figure 4.5: Authors per publication 4.2.6 Publications per author The publications per author plot in Figure 5.6 reveals that few persons have high number of publications which indicates them to be current having high throughput of publications which is measure of productivity with in the community. Figure 4.6: Publication per author 19
  22. 22. 4.2.7 Records in DBLP (grouped by year) The records are group by year and shown in Figure 5.7. Exponential increase has been recorded since 1980. Figure 4.7: Records in DBLP (grouped by year) 20
  23. 23. 5 Experimental Setup Many experiments were carried out on the use case of co-authorship dataset. This chapter focuses on the experimental setup, scientific approach and configurations used to solve the problem. 5.1 Layered Approach A simple layered approach is adopted for solving the modular problem of team formation. All the modules are independent except for the core machine learning module provides a base for effective team formation later on. The layers are as under 1. Dataset Preprocessing 2. Features Selection 3. Machine Learning 4. Team Formation 5.1.1 Dataset Preprocessing Due to incompleteness and high skewness of dataset at the starting and ending periods, both the starting (1936-1939) and ending (2010-2013) period are discarded. The dataset used for experimentation will be from 1940 to 2009. Only articles, in-proceedings, in-collection and books are considered which contributes more than 97% of the data [25]. The reason to discard other publications was to maintain uniformity in mapping external bibliographic information on collaboration graphs since other records have different attributes than the three considered. The dataset goes through two sequential processing steps. First step includes bibliographic parser and filter which reads the dataset and filters out irrelevant information. It also ignores any researcher with less than 2 publications or less than 2 collaborations. Second unit makes yearly snapshots of bipartite graph from coauthorship bibliography as shown in figure 5.1. 5.1.1.1 Bipartite Graph A bipartite co-authorship graph is derived from the bibliography which is an example of a social network as shown in Figure 5.2. Each publication have at least one author 21
  24. 24. Figure 5.1: Data processing pipeline and each author have at least one publication in the bipartite graph. Any publication with only single author will remain as unreachable node and eventually be discarded during feature computations. Figure 5.2: Bipartite Graph from bibliography Each yearly incremental snapshots of coauthor bipartite graphs are created using the bibliographic information. Each year contains all the publications and authors which either appear in the current snapshot year of before which makes it incremental in nature. This graphical topology is further used in for analysis and computation of various graph based features used for predictions explained in next chapter. 5.1.1.2 Bipartite Graph Generator Graph generator read the bibliographic data, and generate a series of author-publication snapshots for each year. Figure 1.3 shows the prediction target year t+w and present time t. The time dimension is also sliced between training, validation and testing data. 5.2 Features Selection The following linear pipeline shows the network features selected and merged with network exogenous features to make combined features to be used for prediction and 22
  25. 25. Figure 5.3: Bipartite graph with yearly snapshots better overall performance. Figure 5.4: Features selection and merge 5.2.1 Cost Estimation Model Empirical data has shown that salary of a researcher usually follows logarithmic increase with his number of publications and research experience. Although there are different factors (unquantifiable) like demographics and research domain but to simplify the problem, a simple cost model is used. This cost is an estimate since it is not possible to get accurate information about the salary of researcher without detailed survey which would become out of scope of this work. 5.2.2 Link Prediction Problem Different graph topology features from literature[5, 4, 27, 28, 29, 30] are being used during experimentation. Most of these features are adapted to particular need of co-authorship bipartite graph. Each feature is used for both publications and collaborations. e.g. common collaborators and common publications between two researchers are different quantities. Some of the most important feature are discussed and mentioned in this report. Feature extractor use to extract feature of the 23
  26. 26. authors/researchers in order to measure the proximity to other authors and measure centrality[5]. For measuring future of each author/researcher, some of the node level and some relational features computed using the network topology. Relational level features are eventually aggregated to make node level features. For aggregation, average is during experimentation. 5.2.2.1 Bibliographics Features Some of the features used using the auxiliary information about each researcher which is independent of network topology are Features Career Time Last Rest Time Publication Interval Publication Rate Description It is the number of years since the authors published his first publication. It is the last time since the author published a paper. It is the average interval between any consecutive publications of the author. It is the number of publication per year at which the author is publishing and contributing in the community. 5.2.2.2 Network Topological Features Following are some of the network features being used from both the perspective of collaborations and publications. The network topological features are aggregated to node level so that the future collaborations and publications can be predicted which quantify the productivity of the node. 24
  27. 27. Features Common Neighbors Jaccard Coefficient Clustering Coefficient Adamic/Adar Preferential Attachment Katz Rooted Page-rank Hitting Time Description It is the most direct implementation of this idea for link prediction is to define score(x, y) := |Γ(x) Γ(y)|, the number of neighbors that x and y have in common[30, 3]. It has computed in the context of collaboration networks, verifying a correlation between the number of common neighbors of x and y at time t, and the probability that they will collaborate in the future. It is a commonly used similarity metric in information retrieval [30]—measures the probability that both x and y have a feature f , for a randomly selected feature f that eitherx or y has. If we take “features” here to be neighbors in Gcollab , this leads to the measure score(x, y) := |Γ(x) Γ(y)|/|Γ(x) Γ(y)| It is a measure of degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of ties; this likelihood tends to be greater than the average probability of a tie randomly established between two nodes. The local clustering coefficient of a node in a graph quantifies how close its neighbors are to being a clique. It consider a related measure, in the context of deciding when two nodes are strongly related[4]. It has received considerable attention as a model of the growth of networks. The basic premise is that the probability that a new edge involves node x is proportional to |Γ(x)|, the current number of neighbors of x. [30] have further proposed, on the basis of empirical evidence, that the probability of co-authorship of x and y is correlated with the product of the number of collaborators of x and y. This corresponds to the measure score(x, y) := |Γ(x)| · |Γ(y)| It defines a measure that directly sums over this collection of paths, exponentially damped by length to count short paths more heavily[5]. Random resets form the basis of the PR measure [31]for Web pages, and we can adapt it for link prediction as follows: Define score(x, y) under the rooted Page-Rank measure to be the stationary probability of y in a random walk that returns to x with probability Alpha each step, moving to a random neighbor with probability 1−α. During experimentation, α has been fine tuned to improve 25 results. It has one difficulty as a measure of proximity is that it is quite small whenever y is a node with a large stationary probability, regardless of the identity of x. To counterbalance this phenomenon, we also consider normalized versions of the hitting and commute times
  28. 28. 5.3 Supervised Machine Leaning Machine learning is the most important and core task of this research work. Machine learning models provide the basis to classical team formation algorithms which are well studied in the domain of operations research. Basically publication-collaboration volume prediction is similar to link or degree prediction in graphs. There are two learning model based on regression model which take features of the network as input and try to predict two things respectively. i.e. volume of publications and volume of collaborations. As discussed earlier (equation 3.1), the productivity is described in terms if volume of of publications and collaborations. Volume of collaborations of an author implies, how many overall different authors in the community with whom the current author has collaborated. Figure 5.4 shows the training of regression model based on test set. The performance is evaluated afterwards. Figure 5.5: Machine Learning Architecture 5.3.1 Train, Validate and Test Sets One of the important task is to divide the dataset into training, validation and testing set. Since a time dimension is associated with each bit of information. Figure 1.3 shows the slices between training/validation and testing data. It also depicts the present time and future time which will be target of prediction. Data skewness increases exponentially with time in this dataset. 26
  29. 29. 5.3.2 Regression Problem Since productivity is defined in terms of collaboration and publication volume. Both are numerical quantities which makes the learning problem a regression problem since the target value is numeric in nature. The models most relevant to be used in this problem setting are defined briefly as under. In linear regression, data are modeled using linear predictor functions, and unknown model parameters are estimated from the data. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of dependent variable y given explanatory X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis. Decision Tree M5Base implements base routines for generating M5 model trees and rules The original algorithm M5 was invented by R. Quinlan [32] and Yong Wang[33] made improvements. A multilayer perceptron (MLP) is a feed-forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. MLP utilizes a supervised learning technique called back-propagation for training the network. MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable. Multilayer perceptron is a classifier that uses back-propagation to classify instances. REPTree is a fast decision tree learner. It builds a decision/regression tree using information gain/variance reduction and prunes it using reduced-error pruning (with back fitting). Only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces. The root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values actually observed. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSE serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent. 5.4 Team Formation 5.4.1 0-1 Knapsack Problem The knapsack problem or rucksack problem is a problem in combinatorial optimization. Given a set of items, each with a mass and a value, determine the number of 27
  30. 30. each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. It derives its name from the problem faced by someone who is constrained by a fixed-size knapsack and must fill it with the most valuable items. The problem often arises in resource allocation where there are financial constraints and is studied in fields such as combinatorics, computer science, complexity theory, cryptography and applied mathematics. For team selector an approach of 0-1 Knapsack[16] is being used in which team is built by selecting the authors which will maximize predicted productivity while living with in constraint of budget B and team size ||T ||. Equations 3.2 and 3.3 formalize the problem as a knapsack problem. 28
  31. 31. 6 Experimental Results & Conclusion The regression models learned and optimized over training and validation set respectively, are used to predict the publication and collaboration of researchers in test dataset. These two quantities constitute the overall productivity of the researcher as in equation (3.1) 6.1 Machine Learning Performance 6.1.1 Regression Models The regression models trained and validated earlier, are used to predict the collaborations and publications volume on testing dataset for each researcher. For a prediction window w is empirically set to 2 and 4 years. Due to high skewness in data during last years, all the regression models deteriorates in performance as prediction window size w exceeds. So prediction window is critical for model performance. The measure used to evaluate residual of regression models is root mean squared error (RMSE). The performance of the various regression models for both publications and collaborations volume is given in the tables below. Prediction Window Error Metric 4 Root Mean Square RMSE Model Γt+4 (ai ) pub Γt+4 (ai ) collab 3.33 4.25 3.54 3.36 4.29 3.14 3.68 3.34 2.87 3.24 3.08 2.89 3.32 2.80 3.09 2.93 Simple Linear Regression Least Median Square Multilayer Perceptron Non Linear Regression Decision Stump M5P Isotonic Regression REPTree For a prediction window w = 4, the M5P decision tree performs the best. 29
  32. 32. Prediction Window Error Metric 2 Root Mean Square RMSE Model Γt+2 (ai ) Γt+2 (ai ) pub collab Simple Linear Regression Least Median Square Multilayer Perceptron Non Linear Regression Decision Stump M5P Isotonic Regression REPTree 1.73 2.47 1.89 1.74 2.78 1.65 1.89 1.74 1.68 1.85 1.82 1.69 2.17 1.67 1.75 1.73 For a prediction window w = 2, the M5P decision tree again performs the best. 6.1.2 Baseline for performance The baseline proposed here to judge performance of the prediction model’s is to compare the current average number of collaborations and current average number of publications at current time t i.e. current productivity with that of predicted collaborations and publications i.e. predicted productivity. 6.1.3 Result’s Evaluation The curves in figure 6.1 shows that the prediction model (red) normally exceeds the baseline (blue) and proves to be better than baseline. The model’s performs less accurate relative to actual productivity (yellow) due to some other network exogenous factors (e.g. research environment and funding) which may influence researcher productivity. The performance of the models can be improved by incorporating more network exogenous information along with network topological features. 6.2 Team Selection Performance Team selection strategy is based on 0-1 knapsack algorithm as described in chapter 5. During team selection process, it is assumed that anyone among the researchers can be hired but in reality this is often impossible. Prediction window is defined to 2 years and M5P regression models which is optimal and giving closest results in training and validation datasets. Teams are selected based on baseline productivity knowledge, productivity predictions based on regression model and finally actual productivity using future graph. 30
  33. 33. Figure 6.1: Prospect member’s productivity (Baseline, Predicted, Actual) for 6.2.1 Baseline for performance The performance of team selection strategy is based on oracle who have access to future network. Oracle has access to future graph and it can measure the actual performance of the team selected. In order to measure performance of team selection strategy, the latest available time was considered to be included in test set but actually the test set contains all the actual information which is modeled by machine learning models and team selection algorithm as described in chapter 5 subsection 5.3.1. 6.2.2 Result’s Evaluation It is seen in figure 6.2 that using machine learning approach performs better than baseline approach but it is relatively less accurate compared to maximum productivity which can be gained from the same pool of researchers from the community. The reason of this poor performance comes out to be mostly based on prediction error from models. The high skewness of publications and collaborations plays an important role for its lower performance of model. 31
  34. 34. Figure 6.2: Team’s accumulative productivity with baseline, regression models and oracle 32
  35. 35. 7 Future Work 7.1 Scientific Contribution The major scientific contributions include • Proposing researcher’s productivity metric in case of co-authorship data. • Proposing two phased team formation strategy based on regression model. • Adaption of various graph-theoretic features to deal with co-authorship graphs. • Merger of various network exogenous bibliographic features within network topology to be used simultaneously. • Adaption of random walk algorithm in case of co-authorship bipartite graph. • Proposing cost model to estimate salary of researcher based on empirical data. 7.2 Problem Extensions A future extension of this problem may add other interesting constraints. An intuition to such a case can be formation of a team with multiple, non overlapping and complementary competencies using a multi-model network e.g. computer science and physics etc. The challenging part will be traedeoff between specialty versus generality. Another possible extension to this problem is to consider weighted collaboration since collaboration with a senior scientist is relatively more important than a collaboration with junior scientist. Similarly publication accepted in high impact journal or conference carries more weight than a lower impact journal or conference. Apart from co-authorship networks, another possible use case of this problem is Yahoo! Answers. It may be desired to form a community of most active and knowledgeable users which are available for hiring. Productivity in such a case scenario can be defined as ratio between number of best accepted answers and total number of answers. The network topology will remain same, only some network exogenous features need to be redefined. 33
  36. 36. 7.3 Technical Achievements During the internship period, I achieved several technical advancements which will prove to be beneficial. First I get acquaint my self with the state of the art research environment where I get to know about the latest research work going on in the industry and learned a futurist approach about challenging problems. I also work with Yahoo’s state of the art computing infrastructure during my internship. In order to work with the big sparse graphs, I had the opportunity to work with graphical framework of Neo4j which is used for large sparse graphs. I gained practical experience of working with web scrappers and processing big graphical data, and performed all the development work required for experiments. This work also made extensive use of Java and and R programming languages. 34
  37. 37. Bibliography [1] S. Wasserman and K. Faust, Social network analysis. Cambridge University press Cambridge, 1994. [2] Y. Hany, B. Zhouz, J. Peiz, and Y. Jiay, “Understanding importance of collaborations in co-authorship networks: A supportiveness analysis approach,” [3] M. E. J. Newman, The structure of scientific collaboration networks. 2001. [4] L. A. Adamic and E. Adar, “Friends and neighbors on the web,” [5] D. Liben-Nowell, “The link prediction problem for social networks,” [6] R. V. et al., “Social networks to biological networks: systems biology of mycobacterium tuberculosis,” 2010. [7] B. A. Pescosolido, “The sociology of social networks,” Indiana University. [8] J. Berger, “What makes online content viral?,” [9] Opsahl and Tore, “Node centrality in weighted networks: Generalizing degree and shortest paths.,” 2010. [10] R. R. Khorasgani, “Top leaders community detection approach in information networks,” [11] B. T. et al., “Link prediction in relational data.,” Neural Information Processing Systems, vol. Volume 15, 2003. [12] A. C. et al., “Hierarchical structure and the prediction of missing links in networks.,” Nature, 2008. [13] W. V. D. A. et al., “Discovering social networks from event logs.,” Computer Supported Cooperative Work, 2005. [14] [15] N. Eagle and A. Pentland., “Reality mining: Sensing complex social systems.,” Personal and Ubiquitous Computing, 2006. [16] M. Fokkinga, “A greedy algorithm for team formation that is fair over time,” [17] C. Delort, O. Spanjaard, and P. Weng, “Committee selection with a weight constraint based on a pairwise dominance relation,” [18] S. OMKAR, “Cricket team selection using genetic algorithm,” [19] E. D. Goodman, “Introduction to genetic algorithms,” tech. rep. 35
  38. 38. [20] T. Lappas, K. Liu, and E. Terzi, “Finding a team of experts in social networks,” [21] A. Gajewar and A. D. Sarma, “Multi-skill collaborative teams based on densest subgraphs,” SDM, 2012. [22] C.-T. Li and M.-K. Shan, “Team formation for generalized tasks in expertise social networks,” SocialCom, 2010. [23] C. Dorn and S. Dustdar, “Composing near-optimal expert teams: A trade-off between skills and connectivity,” CoopIS, 2010. [24] M. Kargar and A. An, “Discovering top-k teams of experts with/without a leader in social networks,” CIKM, 2011. [25] M. Ley, “Dblp-some lessons learned,” [26] M. Ley, “Dblp xml requests,” [27] F. Bonchi, “Influence propagation in social networks: A data mining perspective,” [28] B. B. Cambazoglu., “Cold start link prediction,” [29] A. Goyal, “Learning influence probabilities in social networks,” [30] M. E. J. Newman, “Scientific collaboration networks 1, network construction and fundamental results,” [31] L. Page and S. Brin, “The pagerank citation ranking: Bringing order to the web,” [32] R. J. Quinlan, “Learning with continuous classes,” in 5th Australian Joint Conference on Artificial Intelligence, (Singapore), pp. 343–348, World Scientific, 1992. [33] Y. Wang and I. H. Witten, “Induction of model trees for predicting continuous classes,” in Poster papers of the 9th European Conference on Machine Learning, Springer, 1997. [34] M. E. J. Newman, “Scientific collaboration networks. 2. shortest paths, weighted networks, and centrality,” [35] 36
  39. 39. List of Figures 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Publication dates . . . . . . . Publication volume by type . Average authors per paper . . Publications per Year . . . . . Authors per publication . . . Publication per author . . . . Records in DBLP (grouped by . . . . . . . . . . . . . . . . . . . . . . . . year) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 18 18 19 19 20 5.1 5.2 5.3 5.4 5.5 Data processing pipeline . . . . . . . Bipartite Graph from bibliography . Bipartite graph with yearly snapshots Features selection and merge . . . . . Machine Learning Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 23 23 26 6.1 6.2 Prospect member’s productivity (Baseline, Predicted, Actual) for . . 31 Team’s accumulative productivity with baseline, regression models and oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 37

×