• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010
 

Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010

on

  • 1,226 views

 

Statistics

Views

Total Views
1,226
Views on SlideShare
497
Embed Views
729

Actions

Likes
0
Downloads
2
Comments
0

4 Embeds 729

http://netflux.wordpress.com 477
http://belak.net 246
http://webcache.googleusercontent.com 3
http://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010 Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010 Presentation Transcript

    • Life-Cycles and Mutual Effects of Scientific Communities V´clav Bel´k, Marcel Karnstedt, Conor Hayes a a Digital Enterprise Research Institute NUI Galway ASNA 2010, Z¨rich u
    • Introduction Methodology Data-Set Results Conclusion and FW Motivation Progress in science is often measured by citation measures, which are relatively static Detection and explanation of evolution and life-cycles provides better arguments for the progress Previous approaches focused mainly on analysing co-citation graphs or textual clustering Little work on analysis of cross-community effects Kuhn [5] claimed the development of scientific knowledge proceeds in discrete steps: Pre-paradigm period Paradigm period—normal science Crisis Reaction to the crisis—paradigm shift 1 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Cross-Community Effects I Expected Phenomena Expected Phenomena Clique: Graph & Network Analysis Cluster Clique: Graph & Network Analysis Cluster ParadigmParadigm shift shift Paradigm merge Paradigm merge (a) Community shift (b) Community merge (with community shift) 2 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Cross-Community Effects II Although inspired by Kuhn, we expected evolution of communities in rather an alleviated form Instead of paradigm shift, we were looking for community shift Community merge is a complementary phenomenon, but rather uninteresting one Thus, rather combinations of shifts with subsequent merges, i.e. community merge/shifts, were investigated Instead of paradigm articulation, we were looking for community specialization Co-citation networks of two big camps in CS were analysed: Semantic Web (solution-driven) and Information Retrieval (problem-driven) [1] 3 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Outline 1 Methodology 2 Data-Sets 3 Results 4 Conclusion and Future Work 4 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Initial Expectations&Requirements The methodology was developed with a set of certain requirements arising from the nature of the problem: 1 Dynamic data-set represented by snapshots of several consecutive time-steps 2 Communities have to be identified in the network in each time-step 3 Authors (nodes in general) have to be uniquely identified among all time-steps 4 For topical analysis, meta-data (topics) describing the nodes are necessary 5 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Community Detection We identified communities using three popular algorithms: Infomap [7] Louvain [2] WT [8] All have publicly available implementations, are able to operate over weighted networks, and produce non-overlapping communities In each time-step t, we identified clustering C t of n communities: C t = {c1 , c2 , ..., cn }, where n is determined t t t automatically for each time-step 6 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Tracking of Dynamic Communities Communities are identified independently for each time-step. It is thus necessary to track the evolution of each community in further time-steps Communities were matched according to the highest Jaccard coefficient: |cit ∩ cjt+1 | match(cit ) = arg max t cjt+1 ∈C t+1 |ci ∪ cjt+1 | Important ancestors and descendants were identified by modified Jaccard coefficient: |cit ∩ cjt+1 | |cit ∩ cjt+1 | ancestor (cit , cjt+1 ) = , descendant(cit , cjt+1 ) = |cjt+1 | |cit | 7 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Visualization To compare and inspect the state of the network in different time-steps, a proper visualization is very helpful Nodes that appeared previously should have similar positions Colours denoting the affiliation of the node to its cluster should be preserved As we have not found any existing tool implementing these requirements, we built our own one based on JUNG Another tool based on Graphviz was build to automatically create diagrams of ancestors and descendants based on respective relations 8 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Topic Detection I We mined keywords using NLP techniques [3] from the abstracts or full-texts for almost 70% of the underlying articles Tokenised and stemmed [6] keywords were then assigned to each author Ability of keywords to discriminate authors was ranked according to their frequency (TF) and uniqueness in the corpus (IAF): TF-IAF Each author a in time-step t was thus described by a t bag-of-words vector ka Topical description of cluster c was obtained by a centroid of its members Cosine similarity was used for determining topical similarity of two clusters 9 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Topic Detection II Interpretation of a cluster’s topic was based on characterizing keywords—a union of: 20 highest ranked keywords 20 most frequent keywords We were particularly interested in cross-community activity between IR and SW camps Definition what is IR- and what SW-related community was based on frequent patterns mined from the publications Any event detected by community topic evolution measures associated with both IR- and SW-related communities was then considered as an inter-camp dynamics Meta-data was used to assess the quality of clusterings—WT was omitted from further analysis 10 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Measures Overlap measures induce huge number of inter-reactions between communities Solution is to apply more specific measures or to use the simple ones in combination We developed and/or used two categories of measures 1 community life-cycle measures for measurement and explanation the state and the evolution of the community 2 community topic evolution measures for revealing of cross-community phenomena like community shift 11 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Community Life-Cycle Measures Structural perspective: size S average vertex betweenness B, RB ∈ R+ relative density ρ, Rρ ∈ [0, 1] author entropy A, RA ∈ [0, 1] Topical perspective: topic drift T , RT ∈ [0, 1] cluster content ratio H, RH ∈ R+ 12 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Community Topic Evolution Measures We looked for parallel changes of structure and topic of communities Structural and topical measures were combined by multiplication for simplicity and because the range remains within [0, 1] Community shift PS may be detected as an emergence of a new community topically distinct from its ancestor: PS (cit , cjt+1 ) = dissim(cit , cjt+1 ) × ancestor (cit , cjt+1 ) 13 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Community Topic Evolution Measures II Community shift/merge PS/M may be detected as a merge of two topically distinct community: PS/M (cit , cjt+1 ) = dissim(cit , cjt+1 ) × descendant(cit , cjt+1 ) Note that both PS and PS/M are defined only for two different communities, i.e. only if i = j Community topic change PC expresses a change of topic of a structurally stable community: PC (cit ) = dissim(cit , cit+1 ) × (1 − A(cit+1 )) Only events with values > 0.5 and with a minimal overlap of 10 authors were selected for deeper analysis 14 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Data-Set We first picked a set of major conferences in both fields We then selected publications from these conferences from DBLP for 2000–2009 Co-citation network of 5772 authors and 817642 edges over all years was extracted 3-year time-steps with 2-year overlap: 2000–2002, 2001–2003, . . . Total number of articles was 39314 for which we were able to scrape 22975 abstracts and 3740 full-texts Nearly 70% coverage by content We scraped 18313 author-provided keywords for 4102 distinct articles Coverage by these high-quality meta-data was 10% We mined 263742 keywords from abstracts and full-texts 15 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Shift of Louvain Community 26 Emergence of Louvain community 26 was identified as an . inter-camp community shift PS = 0.62 in 2006 It was formed by 80% of community 6 “web IR” and by 20% of community 5 “SW” The keywords in 2006 like “navigation”, “personalization”, and “semantic web” suggests transdisciplinary topics Massive influence of community 15 “SW and IR” in 2007 and a change of topic towards “SW and business processes” . Observed as a low topic drift T = 0.29 IR-related keywords appeared again among characterizing keywords in 2008 . Topic then stabilized: T = 0.65 16 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Evolution of Louvain Community 26 Communities 6 “web information retrieval”, 5 “semantic web”, 15 “semantic web and information retrieval” and their descendant community 26 2005–2007 2006-2008 2007–2009 2008–2009 c5 c5 c5 20 2.8 48.6 c6 80 c26 4.7 c26 51.4 c26 90.6 8.3 c15 c15 c15 17 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Position of Louvain Community 26 in 2006 and 2007 Communities 6 “web information retrieval” (pink), 5 “semantic web” (red), 15 “semantic web and information retrieval” (violet) and their descendant community 26 (green) 18 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Specialization of Infomap Community 9 First oriented on general and core SW-related topics in 2000 Between 2002–2004 we identified 3 shifts One of these shifts was community 99 “semantic desktop and personalization” The community itself then specialized on “SW services” S,T , and H provided valuable insights ρ, B, and A did not seem to provide any further insights 19 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Life-Cycle Measures of Infomap Community 9 2 4500 ρ H 1.8 4000 B 1.6 3500 S A 1.4 3000 T H, T , A, ρ 1.2 2500 B, S 1 2000 0.8 1500 0.6 0.4 1000 0.2 500 0 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 time 20 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Life-Cycle Measures of Infomap Community 99 1.6 1000 ρ H 900 1.4 B 800 S 1.2 A 700 T H, T , A, ρ 1 600 B, S 500 0.8 400 0.6 300 200 0.4 100 0.2 0 2003 2004 2005 2006 2007 2008 time 21 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Shift/Merge of Community 86 . We identified shift/merge PS/M = 0.91 of community 86 with community 0 Both communities were concerned with IR-related topics, but each had its specific theme: 86 being more focused on “development”, “engine”, and “system” 0 being more focused on “question answering” 90.9% of authors from 86 moved to community 0 . Relative density ρ = 0.47 and high cluster content ratio . H = 1.91 suggests it was topically coherent, but structurally weak It is not possible to generalize the suitability of any life-cycle measures as we have identified only one shift/merge 22 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Tag Clouds of Communities 86 and 0 community characterising keywords 2002 c86 intuitive, development, ir, retrieval, control, imple- mented, describing, high-dimensional, reducing, engine, execu- tion, advanced, information, system, multi- dimensional, image, usin, accurate, time, precise, features, queries, service, dataset, document, analysis, large, structure, cluster, and, web, processing resolution, evaluation, passages, architecture, question, qa, 2003 c0 patterns, definitions, development, trec, mit, candidates, linguis- tic, retrieval, answering, system, analysis, javelin, modules, advanced, methods, science, information, approaches, pro- cessing, using, computer, language, techniques 23 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Change of topic of Infomap community 54 . Inter-camp community topic change PC = 0.58 was identified for Infomap community 54 between 2005 and 2006 The topic changed from “knowledge management” and “information extraction” towards “knowledge querying” and “semantic web” Zero author entropy A suggests this might have been caused by new members joining the community 34.5% were completely new, i.e. they did not come from any previous community 20.7% coming from 54 “knowledge management and information extraction” 17.2% coming from 29 “ontologies and SW” 6.9% coming from 70 “ontologies and folksonomies” 6.9% coming from 112 “semantic web services” 24 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Tag Clouds of Infomap community 54 community characterising keywords 2005 c54 organizational, kms, sw, capturing, environment, working, ie, acquisition, wikifactory, legacy, manager, goal, seman- tic, tool, cooperative, layers, healthier, defining, quantitative, knowledge, web, text, learning, techniques, computer, sup- porting, science, machine, documents, information, system 2006 c54 ontologies, language, query, specification, knowl- edge, manager, semantic, pure, capturing, data, search, keyword, layers, keyword-based, hybrid, archi- tecture, spreadsheet, web, ie, application, informa- tion, modelling, approach, algorithm, using, methodic, retrieval, service, system, structures 25 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Emergence of Intermediary Louvain Community 15 The most complex scenario we investigated It first emerged as a descendant of community 4 “IR” with topic “cross-language IR”, which was identified as a . community shift PS = 0.55 in 2003 Since 2004, this community was under a massive influence of community 5 “SW”, which caused a change towards . SW-related topics PC = 0.31 Since 2005, IR-related keywords appeared again among characterizing keywords, while those keywords disappeared in community 5 Therefore, whereas community 5 kept its focus on the core SW-related topics, it largely participated in forming of a new interdisciplinary community 26 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Betweenness of Louvain Community 15 Despite of being still focused on mainly SW-related topics, community 15 worked as an intermediary of both camps This hypothesis is supported by high average author betweenness B 2004–2006 2007–2009 S B S B c15 444 1591.01659 445 2535.02 entire network 2776 2066.70764 2190 2192.85117 27 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Position of Louvain Community 15 in 2004 and 2007 Community 5 “SW” (red—left side), “IR” communities 0, 4, 6 and 9 (grey, beige, pink and red—right side, respectively) and their intermediary community 15 (violet) 28 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Conclusion and Future Work I We presented a general and scalable methodology for analysis of cross-community phenomena uniquely combining topological and content analysis and supported by special visualization techniques Three community topic evolution measures tailored for identifying phenomena like community shift, shift/merge, and change of topic were proposed and successfully assessed Community shift and topic change were detected quite commonly, which suggests that they are part of many community life-cycles Community shift/merge was detected very rarely, which either means we have to improve the measure or that this is simply a rare phenomenon We proposed life-cycle measures characterising the states and evolution of communities 29 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Conclusion and Future Work II The assessment showed that average vertex betweenness, relative density, cluster content ratio, and topic drift offered valuable insights into the phenomena revealed by community topic evolution measures We observed strong shifts PS → 1, when the shifted community disappeared in the next time-step These strong shifts had usually very different but coherent topics They might have been the initial sources of new topics or even research streams Frequently, a newly emerged community had quite weak structure (low ρ, high A) and/or topic (low T ), while these characteristics then improved in the subsequent time-steps B seems to be a good measure for identification of intermediary communities 30 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW Conclusion and Future Work III We intend to cluster the community life-cycles by the characteristic events expressed by all the measures We expect this to provide an automated way of extracting life-cycle taxonomies The combination of content and structural analysis allowed us to assess the quality of clustering revealed only by inspection of structure of the network We consider this original approach as a fertile ground for future research We plan to use other algorithms—e.g. co-clustering algorithm of both content and objects [4] We will extend the whole work to a larger data-set 31 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW References I R. Baeza-Yates, P. Mika, and H. Zaragoza. Search, Web 2.0, and the Semantic Web. IEEE Intelligent Systems, 23(1):80–82, 2008. Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, P10008, 2008. Georgeta Bordea. The Semantic Web: Research and Applications, chapter Concept Extraction Applied to the Task of Expert Finding , pages 451–456. Springer, 2010. 32 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW References II Derek Greene and P´draig Cunningham. a Spectral Co-Clustering for Dynamic Bipartite Graphs. Technical report, School of Computer Science & Informatics, UCD, 2010. Th. S. Kuhn. The Structure of Scientific Revolutions. University Of Chicago Press, December 1996. Martin F. Porter. An algorithm for suffix stripping. Program, 14:130–137, 1980. 33 / 34
    • Introduction Methodology Data-Set Results Conclusion and FW References III Martin Rosvall and Carl T. Bergstrom. Maps of random walks on complex networks reveal community structure. In National Academy of Sciences USA, volume 105, pages 1118–1123, 2008. Ken Wakita and Toshiyuki Tsurumi. Finding community structure in a mega-scale social networking service. In IADIS international conference on WWW/Internet 2007, pages 153–162, 2007. 34 / 34