• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content


Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Wiki dev nlp






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Wiki dev nlp Wiki dev nlp Presentation Transcript

    • ICSM 2010 - ERA Analyzing Natural-Language Artifacts Of The Software Process M. Hasan, E. Stroulia, D. Barbosa, M. Alalfi Oct-23-10 University of Alberta http://ssrg.cs.ualberta.ca
    • WikiDev2.0 101 2   Objective: to make explicit the elements around software collaboration: artifacts, people and communication   Integrateartifacts, people and communications on the web   URLs and (hyper)links are the mechanism   How it works:   Information from SVN, tickets, emails, IRC chats, source code, code- analysis tools (JDEvAn) is incrementally imported   Everything gets a URL   Analyses generate hyperlinks between these URLs   Interactive views enable the exploration of the original information and analyses results ICSM2010 - ERA 10/23/10
    • The WikiDev2.0 Architecture 3 WikiDEv2.0 is a test-bed for experiments The Question here:System WikiDev 2.0 Annoki What information can we find in text and how? WikiDev 2.0 Access Controls Content Analysis Text analysis 3D OWL Visualization Template Editor Calendars SVN integration Project lifecycle graphs Visualizations LaTeX Convertor Custom Bug-tracker integration Communication graphs WikiDev Text Replace Tagging database Mailing list integration UML model integration Page Feedback Structural Differencing Custom Annoki database IRC integration Artifact clustering Annoki Control and common functions Stock Common WikiDev functions MediaWiki database Stock MediaWiki core ICSM2010 - ERA 10/23/10
    • Text analysis in WikiDev   Two text-analysis methods   Five sources of textual data   Lexicalanalysis with TAPoR   Wiki pages, ticket text, email   Syntactic/semantic analysis conversations, IRC chats, SVN comments &'%,')--./,0 !"#"$%&4+/%54666000   The underlying model $)/,1)," 5%'+05.(843%%&"')("666 1*"45%'+4666 (%%$ !"#"$%&"' 38"3+43%--.(4666 3'")("4)!!4666 '"#.*.%/ )'(.2)3( 8)/!$"4 2.94'"*%$#"4666 -%!.274666 (.3+"( ()*+ -%!.27438)/,"4666 ICSM 2010 - ERA Oct-23-10
    • Some interesting sentences   User9 tried to run the Php code; User5 has started learning Php.   User7 neaten up the UI along the side of UML display; User8 take care of documentation.   User5 handled associations in XMIParser; User8 changed wikiroamer and wikiviewfactor.   User6 modify User5’s Php file to allow …; User6 and User9 should focus on preparing parser ICSM 2010 - ERA Oct-23-10
    • Sentence: Step 1 I used Java and Eclipse b f dJ d E li Syntactic tags: before. I/PRP used/VBD java/NN and/CC Eclips se/NN before/RB ./. / Dependency Relations: nsubj (used-2 I-1) (used 2, I 1) dobj (used-2, java-3) conj_and (java-3, Eclipse-5) advmod (used-2, before-6) Wikidev 2.0 Textual data- sources Syntactic Parsing Sentence Parse Trees Semmantic Anno otation Semantically & Syntactically annotated XML’s Pattern TAPoR Categories & extraction Analysis y Wordlist W dli t RDF Triples Dictionary XQuery Patterns ICSM 2010 - ERA Oct-23-10
    • Step 2 Users = { User1, User2, ... , User9 }. Programming Languages = { Java, PHP, XML, ... }. Tools = { Eclipse, Bugzilla, IBMJazz, ... }. Tickets = { ticket1, ticket2 , ... }. Revisions = { revision1, revision2, ... }. Action Verbs = { create, debug, implement, fix, make, ... }. Project Tasks = { visualization, documentation, user interface, testing, ... }. Project Artifacts = { class, method, database, script, ... }. Wikidev 2.0 Textual data- sources Syntactic Parsing Sentence Parse Trees Semmantic Anno otation Semantically & Syntactically annotated XML’s Pattern TAPoR Categories & extraction Analysis y Wordlist W dli t RDF Triples Dictionary XQuery Patterns ICSM 2010 - ERA Oct-23-10
    • 1400 1200 1000 activities frequency 800 Month1 Month2 600 Month3 400 Month4 200 0 Planning Communication development Testing Deployment 800 700 600 user9 user8 500 user7 user6 400 user5 300 user4 user3 200 user2 100 user1 0 Month1 Month2 Month3 Month4 ICSM 2010 - ERA Oct-23-10
    • Annotated XML for the sentence “I used Java and Eclipse before.” Step <S Type="ticket-description" ticketId sentId="127 Author="User1"> 3 d="1" I used Java and Eclipse before. p <Verb stem="use" ID="1" POS="VB BN" Relation="root"> Used <PRP ID="1" POS="PRP" Relatio on="nsubj" semanticTag="Developer" Nam me="User1"> I </PRP> <Noun ID="2" POS="NNP" Relat tion="dobj" semanticTag= language > va semanticTag="language"> Jav <Noun ID="4" POS="NNP" Relation="conj_and" semanticTag="tool"> Eclipse </Noun> </Noun> /N <Adverb ID="5" POS="RB" Re elation="advmod"> before </Adverb> Wikidev 2.0 Textual data- </Verb> </V b> Syntactic sources </S> Parsing Sentence Parse Trees Semmantic Anno otation Semantically & Syntactically annotated XML’s Pattern TAPoR Categories & extraction Analysis y Wordlist W dli t RDF Triples Dictionary XQuery Patterns ICSM 2010 - ERA Oct-23-10
    • !"#$%!&'(#)* object Rule1 relation Step 4 0('%4 Ej)/%4 2'6 +,+!&-* Ei ./'0!&1232* "#$%!&523#* !"#$%!&'$#()#*+ object Rule2 relation (1; 291%: *0%: Ej ,-,!&.+ /012!&3$04$(5+ "#$%!&8(7#+ Ei noun-modifier relation 2012<50*=>=#$ Wikidev 2.0 Textual data- Ek sources Syntactic Parsing /012!&6(7(+ Sentence Parse Trees ! Semmantic Anno otation Semantically & Syntactically annotated XML’s Pattern TAPoR Categories & extraction Analysis y Wordlist W dli t RDF Triples Dictionary ! ! XQuery Patterns ICSM 2010 - ERA Oct-23-10
    • !!!!"#$! .'/,0! 345! %&''()*! +,%-(* '(11/2( %6/*1 .7*8/%*(9!"/':0(1 $;'<(8!&=!"()*()%(1 "#" $%& '(' "$") $;'<(8!&=!/))&*/*(9!>?@1 "'% $%$ '%$ ")$( $;'<(8!&=!+8,:0(*1 $#' ** $%# (") A9(B(0&:(8C!;1(DEEEC!0/)2;/2(F $) % $# (# +,-./%!012,-!13!45657 A9(B(0&:(8C!G&8-DEEEC!*&&0F $% ' 8) *' +,-./#!,-.9!:2;<=-.7 A9(B(0&:(8C!%8(/*(DEEEC!/8*,=/%*F (8 "' ## 8$) +,-./#!>539;.!?@AB5/-./7 A9(B(0&:(8C!6/)90(DEEEC!*/1-F $8 # $) %( +,-./(!C1/D!13!!EA7 A9(B(0&:(8C!=,7DEEEC!*,%-(*F ' # ' $% +,-./&!0<F.9!G,H#7 A9(B(0&:(8C!%6(%-DEEEC!8(B,1,&)F # ( ) " +,-./%!913.!/.6<-<13#"7 A/8*,=/%*C!%6/)2(DEEEC!*/1-F 8# $8 * %' +E@IJ539;./!!K19<0L!!?K;7 A9(B(0&:(8C!%&&:(8/*(DEEEC! ! ) " ! #% "$) +,-./%!M!,-./$*!!012,-!13!=5/-./7 9(B(0&:(8F ICSM 2010 - ERA Oct-23-10
    • Empirical Evaluation Triples found Developer (D) 54 The tool 39 Triples missed 28 52% (out of the 54 identified by D) Correct triples 19 49% (out of the 39 found) Incorrect triples 20 51% (out of the 39 found) Missed triples 3 ICSM 2010 - ERA Oct-23-10
    • Conclusions   There is substantial information in the text associated with the software process   Developer experience   Decision rationale   Problems and solutions considered   We developed a method for lexical, syntactic, and semantic analysis of textual data produced during the life-cycle of a software project.   The empirical analysis shows that   Interesting data can be extracted   More robust parsing and better entity-phrase recognition is necessary   We are currently working towards (a) extending the suite of RDF-triple patterns, and (b) developing a domain-specific query language for flexible question answering on the project lifecycle. ICSM 2010 - ERA Oct-23-10
    • Analyzing Natural-Language Artifacts of the And the Poster! Software Process Maryam Hasan, Eleni Stroulia, Denilson Barbosa, and Manar Alalfi Department of Computing Science, University of Alberta http://ssrg.cs.ualberta.ca/index.php/WikiDev_2.0 Motivation Syntactic/Semantic Analysis Syntactic/Semantic Analysis Results !  Software developers generate a substantial Syntactic/semantic analysis, integrates computational- stream of textual data as they communicate linguistic techniques with domain-specific knowledge, From triples extracted by syntactic/semantic analysis we during the life-cycle of their projects. in order to extract useful pieces of information as RDF can find useful pieces of information, as the following: !  Through emails and chats, developers discuss the triples. This method consists of the following stages: !  Expertise of developers: requirements of their software system, they <User5 started learning PHP> negotiate the distribution of tasks among them, Use !  Responsibility of developers: and they make decisions about the system •  Syntactic parsing: (Verb) design, and the internal structure and <User8 do documentation> assigns a syntactic tag for functionalities of its code modules. nsubj aux !  Developers contribution: each word and the dobj !  A careful analysis of such text reveals valuable grammatical relationships I <User5 handle associations in XMIParser> Have (PRP) Java !  Developers’ relationships: information about various aspects of the software between word pairs. (Noun) (Verb) life-cycle <User6 modify User5’s PHP file> Figure2: Syntax parse tree •  Semantic Annotation: Table1: Summarily reports on the results of syntactic/ Assigns terms with semantic analysis. Use (Verb) semantic tags from a domain-specific Methodology vocabulary. The nsubj dob j aux SVN Tick Email IRC com mess Sample Triple vocabulary created for this men et age chat I (PRP) Have t domain includes these Java (Verb) !  We applied two complimentary text-analysis categories: Users, name = (Noun) methods to examine five different sources of textual Programming Languages, User1 STag = Number of 169 484 3130 Sentences 353 data of a team project and gain valuable Tools, Project Tasks, STag = Lang information about various aspects of the software Project Artifacts, Tickets, develop life-cycle. The five textual-data sources are (a) wiki Revisions. er Number of 346 161 461 3018 pages, (b) SVN comments, (c) tickets, (d) email Figure3: Semantically annotated tree annotated XMLs messages, and (e) IRC chats. •  Pattern Extraction: !  The first method is based on approximate and extracts subject-predicate- Number of Triplets 154 77 165 830 efficient analysis, at the lexical level, using the off- object patterns from annotated the-shelf lexical-analysis toolkit TAPoR. XMLs using XQuery to retrieve <user6 focus on <developer, use/..., 10 6 15 85 interesting RDF triples. The language> Java> !  The second is much more accurate, albeit more extracted triples constitute an expensive from a computational point of view, and <developer, instance of a rich conceptual work/..., tool> 16 4 20 74 <user5 used Eclipse> focuses on the text at the syntactic and semantic model of the domain that level. captures interesting relations Programming <developer, <user5 handle between developers and Language create/..., artifact> 82 34 55 210 XMIParser> Categories software products. Develop/ TAPoR Analysi & XQuery Write/.. <developer, 12 5 10 68 <user8 work on UI> Dictionary Tool handle/..., task> s Patterns Cooperate / Know / Work with /.. <developer, fix/..., Use/.. ticket> 4 5 4 16 <user9 fixed bug5> Developer Create / <user6 done <developer, Commit / Add /.. check/..., revision> 5 8 0 3 revision53> WikiDev Annotat Check / Artifact RDF 2.0 Textual ed <UMLHandler modify Data Parse Triples Revision … Resolve/ Handle/ Change/ <artifact, change/..., 25 Trees XMLs Fix /.. task> 12 7 64 Xml> Work/.. Modify/.. Ticket Task <developer, <user6 & user17 cooperate/..., 0 3 56 310 Pattern Figure4: Semantic relations of the RDF triples created by pattern extraction developer> focus on parser> Semantic Extracti Syntactic Parsing Annotation on TAPoR Analysis Results Conclusion & Future Work Figure1: Tool Architecture Figure 5: Trend of team !  The conversation among the team developers in activities throughout the email messages, IRC chats, SVN commit project life-cycle messages, ticket descriptions and wiki pages TAPoR Analysis contain valuable information about their activities TAPoR (Text Analysis Portal for Research), is a web- and artifacts, issues the team members faced based application to support a suite of text lexical- during their work and the decisions they made. analysis tools, including word counts, word co- !  We are currently working towards: occurrence, word-clouds visualizations, words’ !  Extending the suite of RDF-triple patterns. collocations, and pattern extraction. In our work, we !  Developing a domain-specific query language used TAPoR for two purposes: (based on our underlying conceptual model of Figure 6: Trend of team Figure4) for flexible question answering on the members project lifecycle. !  We used the “most-frequent-words” functionality to communication identify important keywords, which then will be used throughout the project !  Running the experiment on a bigger dataset life-cycle and evaluating the results. for the syntactic/semantic analysis method. !  We applied the word-count and keyword-in-context services to gain insights about interesting trends in ICSM 2010 - ERA the information contained in the different data sources as provided by the team members over the various stages of the project. Oct-23-10 Acknowledgments The authors wish to thank Marios Fokaefs and Ken Bauer for their help with the experiments.