Wiki dev nlp


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Wiki dev nlp

  1. 1. Analyzing Natural-Language Artifacts Of The Software Process M. Hasan, E. Stroulia, D. Barbosa, M. Alalfi University of Alberta http://ssrg.cs.ualberta.caOct-23-10 ICSM 2010 - ERA
  2. 2. WikiDev2.0 101   Objective: to make explicit the elements around software collaboration: artifacts, people and communication   Integrate artifacts, people and communications on the web   URLs and (hyper)links are the mechanism   How it works:   Information from SVN, tickets, emails, IRC chats, source code, code- analysis tools (JDEvAn) is incrementally imported   Everything gets a URL   Analyses generate hyperlinks between these URLs   Interactive views enable the exploration of the original information and analyses results 10/23/10 2 ICSM2010 - ERA
  3. 3. The WikiDev2.0 Architecture 3 LaTeX Convertor Template Editor Calendars Visualizations Text Replace Tagging Annoki Control and common functions Annoki Stock MediaWiki core WikiDev 2.0 System Stock MediaWiki database Custom Annoki database Custom WikiDev database Access Controls Content Analysis Page Feedback Structural Differencing 3D OWL Visualization UML model integration Text analysis Bug-tracker integration Project lifecycle graphs Mailing list integration IRC integration Artifact clustering Common WikiDev functions WikiDev 2.0 Communication graphs SVN integration 10/23/10ICSM2010 - ERA WikiDEv2.0 is a test-bed for experiments The Question here: What information can we find in text and how?
  4. 4. !"#"$%&"' ()*+ &'%,')--./,0 $)/,1)," (%%$ )'(.2)3( (.3+"( '"#.*.%/ !"#"$%&4+/%54666000 1*"45%'+4666 3'")("4)!!4666 -%!.27438)/,"4666 8)/!$"4 -%!.2746662.94'"*%$#"4666 38"3+43%--.(4666 5%'+05.(843%%&"')("666 Text analysis in WikiDev   Two text-analysis methods   Lexical analysis with TAPoR   Syntactic/semantic analysis   The underlying model   Five sources of textual data   Wiki pages, ticket text, email conversations, IRC chats, SVN comments Oct-23-10ICSM 2010 - ERA
  5. 5. Some interesting sentences Oct-23-10ICSM 2010 - ERA   User9 tried to run the Php code; User5 has started learning Php.   User7 neaten up the UI along the side of UML display; User8 take care of documentation.   User5 handled associations in XMIParser; User8 changed wikiroamer and wikiviewfactor.   User6 modify User5’s Php file to allow …; User6 and User9 should focus on preparing parser
  6. 6. Syntactic Parsing Sem Wikidev 2.0 Textual data- sources Sentence Parse Trees Sem Anno TAPoR Analysis Categories & W dli t y Wordlist Dictionary manticmantic otation Semantically & Syntactically annotated XML’s Pattern extraction XQuery Patterns RDF Triples Sentence: I d J d E li b fI used Java and Eclipse before. Syntactic tags: I/PRP used/VBD java/NN and/CC EclipsI/PRP used/VBD java/NN and/CC Eclips Dependency Relations: nsubj (used-2 I-1)nsubj (used 2, I 1) dobj (used-2, java-3) conj_and (java-3, Eclipse-5) advmod (used-2, before-6) se/NN before/RB /se/NN before/RB ./. Step 1 Oct-23-10ICSM 2010 - ERA
  7. 7. Syntactic Parsing Sem Wikidev 2.0 Textual data- sources Sentence Parse Trees Sem Anno TAPoR Analysis Categories & W dli t y Wordlist Dictionary manticmantic otation Semantically & Syntactically annotated XML’s Pattern extraction XQuery Patterns RDF Triples Users = { User1, User2, ... , User9 }. Programming Languages = { Java, PHP, XML, ... }. Tools = { Eclipse, Bugzilla, IBMJazz, ... }. Tickets = { ticket1, ticket2 , ... }. Revisions = { revision1, revision2, ... }. Action Verbs = { create, debug, implement, fix, make, ... }. Project Tasks = { visualization, documentation, user interface, testing, ... }. Project Artifacts = { class, method, database, script, ... }. Oct-23-10ICSM 2010 - ERA Step 2
  8. 8. 0 200 400 600 800 1000 1200 1400 Planning Communication development Testing Deployment activitiesfrequency Month1 Month2 Month3 Month4 0 100 200 300 400 500 600 700 800 Month1 Month2 Month3 Month4 user9 user8 user7 user6 user5 user4 user3 user2 user1 Oct-23-10ICSM 2010 - ERA
  9. 9. Syntactic Parsing Sem Wikidev 2.0 Textual data- sources Sentence Parse Trees Sem Anno TAPoR Analysis Categories & W dli t y Wordlist Dictionary manticmantic otation Semantically & Syntactically annotated XML’s Pattern extraction XQuery Patterns RDF Triples Annotated XML for the sentence “I <S Type="ticket-description" ticketId I used Java and Eclipse before.p <Verb stem="use" ID="1" POS="VB <PRP ID="1" POS="PRP" Relatio semanticTag="Developer" Nam <Noun ID="2" POS="NNP" Relat semanticTag="language"> JavsemanticTag= language > Jav <Noun ID="4" POS="NNP semanticTag="tool"> Eclipse /N</Noun> <Adverb ID="5" POS="RB" Re </V b></Verb> </S> used Java and Eclipse before.” d="1" sentId="127 Author="User1"> BN" Relation="root"> Used on="nsubj" me="User1"> I </PRP> tion="dobj" vava " Relation="conj_and" </Noun> elation="advmod"> before </Adverb> Oct-23-10ICSM 2010 - ERA Step 3
  10. 10. Syntactic Parsing Sem Wikidev 2.0 Textual data- sources Sentence Parse Trees Sem Anno TAPoR Analysis Categories & W dli t y Wordlist Dictionary manticmantic otation Semantically & Syntactically annotated XML’s Pattern extraction XQuery Patterns RDF Triples ! ! !"#$%!&'$#()#*+ ,-,!&.+ /012!&3$04$(5+ /012!&6(7(+ "#$%!&8(7#+ 291%: *0%: (1; 2012<50*=>=#$ ! !"#$%!&'(#)* +,+!&-* ./'0!&1232* 0('%4 )/%4 "#$%!&523#* 2'6 Ei Ej object Rule1 relation Ei Ej Ek object Rule2 relation noun-modifier relation Oct-23-10ICSM 2010 - ERA Step 4
  11. 11. ! ! !!!!"#$! %&''()*! +,%-(* .'/,0! '(11/2( 345! %6/*1 .7*8/%*(9!"/':0(1 $;'<(8!&=!"()*()%(1 "#" $%& '(' "$") $;'<(8!&=!/))&*/*(9!>?@1 "'% $%$ '%$ ")$( $;'<(8!&=!+8,:0(*1 $#' ** $%# (") A9(B(0&:(8C!;1(DEEEC!0/)2;/2(F $) % $# (# +,-./%!012,-!13!45657 A9(B(0&:(8C!G&8-DEEEC!*&&0F $% ' 8) *' +,-./#!,-.9!:2;<=-.7 A9(B(0&:(8C!%8(/*(DEEEC!/8*,=/%*F (8 "' ## 8$) +,-./#!>539;.!?@AB5/-./7 A9(B(0&:(8C!6/)90(DEEEC!*/1-F $8 # $) %( +,-./(!C1/D!13!!EA7 A9(B(0&:(8C!=,7DEEEC!*,%-(*F ' # ' $% +,-./&!0<F.9!G,H#7 A9(B(0&:(8C!%6(%-DEEEC!8(B,1,&)F # ( ) " +,-./%!913.!/.6<-<13#"7 A/8*,=/%*C!%6/)2(DEEEC!*/1-F 8# $8 * %' +E@IJ539;./!!K19<0L!!?K;7 A9(B(0&:(8C!%&&:(8/*(DEEEC! 9(B(0&:(8F ) " #% "$) +,-./%!M!,-./$*!!012,-!13!=5/-./7 Oct-23-10ICSM 2010 - ERA
  12. 12. Empirical Evaluation Oct-23-10ICSM 2010 - ERA Triples found Developer (D) 54 The tool 39 Triples missed 28 52%(out of the 54 identified by D) Correct triples 19 49%(out of the 39 found) Incorrect triples 20 51%(out of the 39 found) Missed triples 3
  13. 13. Conclusions Oct-23-10ICSM 2010 - ERA   There is substantial information in the text associated with the software process   Developer experience   Decision rationale   Problems and solutions considered   We developed a method for lexical, syntactic, and semantic analysis of textual data produced during the life-cycle of a software project.   The empirical analysis shows that   Interesting data can be extracted   More robust parsing and better entity-phrase recognition is necessary   We are currently working towards (a) extending the suite of RDF-triple patterns, and (b) developing a domain-specific query language for flexible question answering on the project lifecycle.
  14. 14. And the Poster! Oct-23-10ICSM 2010 - ERA Motivation Analyzing Natural-Language Artifacts of the Software Process Maryam Hasan, Eleni Stroulia, Denilson Barbosa, and Manar Alalfi Department of Computing Science, University of Alberta Conclusion & Future Work !  The conversation among the team developers in email messages, IRC chats, SVN commit messages, ticket descriptions and wiki pages contain valuable information about their activities and artifacts, issues the team members faced during their work and the decisions they made. !  We are currently working towards: !  Extending the suite of RDF-triple patterns. !  Developing a domain-specific query language (based on our underlying conceptual model of Figure4) for flexible question answering on the project lifecycle. !  Running the experiment on a bigger dataset and evaluating the results. TAPoR Analysis TAPoR (Text Analysis Portal for Research), is a web- based application to support a suite of text lexical- analysis tools, including word counts, word co- occurrence, word-clouds visualizations, words’ collocations, and pattern extraction. In our work, we used TAPoR for two purposes: ! We used the “most-frequent-words” functionality to identify important keywords, which then will be used for the syntactic/semantic analysis method. ! We applied the word-count and keyword-in-context services to gain insights about interesting trends in the information contained in the different data sources as provided by the team members over the various stages of the project. From triples extracted by syntactic/semantic analysis we can find useful pieces of information, as the following: ! Expertise of developers: <User5 started learning PHP> ! Responsibility of developers: <User8 do documentation> ! Developers contribution: <User5 handle associations in XMIParser> ! Developers’ relationships: <User6 modify User5’s PHP file> Methodology !  Software developers generate a substantial stream of textual data as they communicate during the life-cycle of their projects. !  Through emails and chats, developers discuss the requirements of their software system, they negotiate the distribution of tasks among them, and they make decisions about the system design, and the internal structure and functionalities of its code modules. !  A careful analysis of such text reveals valuable information about various aspects of the software life-cycle !  We applied two complimentary text-analysis methods to examine five different sources of textual data of a team project and gain valuable information about various aspects of the software life-cycle. The five textual-data sources are (a) wiki pages, (b) SVN comments, (c) tickets, (d) email messages, and (e) IRC chats. !  The first method is based on approximate and efficient analysis, at the lexical level, using the off- the-shelf lexical-analysis toolkit TAPoR. !  The second is much more accurate, albeit more expensive from a computational point of view, and focuses on the text at the syntactic and semantic level. WikiDev 2.0 Textual Data Syntactic Parsing Semantic Annotation Annotat ed XMLs Pattern Extracti on TAPoR Analysi s Categories & Dictionary RDF Triples XQuery Patterns Parse Trees Figure1: Tool Architecture Syntactic/Semantic Analysis Syntactic/semantic analysis, integrates computational- linguistic techniques with domain-specific knowledge, in order to extract useful pieces of information as RDF triples. This method consists of the following stages: Figure4: Semantic relations of the RDF triples created by pattern extraction Developer Programming Language Tool Artifact TaskTicket Revision Cooperate / Work with /.. Commit / Check / … Resolve/ Fix /.. Handle/ Work/.. Change/ Modify/.. Know / Use/.. Create / Add /.. Develop/ Write/.. SVN com men t Tick et Email mess age IRC chat Sample Triple Number of Sentences 353 169 484 3130 Number of annotated XMLs 346 161 461 3018 Number of Triplets 154 77 165 830 <developer, use/..., language> 10 6 15 85 <user6 focus on Java> <developer, work/..., tool> 16 4 20 74 <user5 used Eclipse> <developer, create/..., artifact> 82 34 55 210 <user5 handle XMIParser> <developer, handle/..., task> 12 5 10 68 <user8 work on UI> <developer, fix/..., ticket> 4 5 4 16 <user9 fixed bug5> <developer, check/..., revision> 5 8 0 3 <user6 done revision53> <artifact, change/..., task> 25 12 7 64 <UMLHandler modify Xml> <developer, cooperate/..., developer> 0 3 56 310 <user6 & user17 focus on parser> TAPoR Analysis Results Figure 5: Trend of team activities throughout the project life-cycle Figure 6: Trend of team members communication throughout the project life-cycle Syntactic/Semantic Analysis Results Acknowledgments •  Syntactic parsing: assigns a syntactic tag for each word and the grammatical relationships between word pairs. •  Semantic Annotation: Assigns terms with semantic tags from a domain-specific vocabulary. The vocabulary created for this domain includes these categories: Users, Programming Languages, Tools, Project Tasks, Project Artifacts, Tickets, Revisions. •  Pattern Extraction: extracts subject-predicate- object patterns from annotated XMLs using XQuery to retrieve interesting RDF triples. The extracted triples constitute an instance of a rich conceptual model of the domain that captures interesting relations between developers and software products. Use (Verb) I (PRP) Java (Noun) Have (Verb) nsubj dobj aux Use (Verb) I (PRP) name = User1 STag = develop er Java (Noun) STag = Lang Have (Verb) nsubj dob j aux Figure3: Semantically annotated tree Figure2: Syntax parse tree The authors wish to thank Marios Fokaefs and Ken Bauer for their help with the experiments. Table1: Summarily reports on the results of syntactic/ semantic analysis.