130919 jim cordy - when is a clone not a clone

  • 148 views
Uploaded on

Software clone, detection, empirical studies, validation

Software clone, detection, empirical studies, validation

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
148
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
232
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. When is a Clone not a Clone? (and vice-versa) Contextualized Analysis of Web Services Douglas Martin Scott Grant James R. Cordy David B. Skillicorn School of Computing Kingston, Canada
  • 2. Motivation —  The Personal Web —  Rapidly growing number of web services makes it increasingly difficult to find and choose the right ones —  Need a quick and convenient way to find alternatives —  Hand tagging impractical – automation is needed!
  • 3. Motivation —  Automation —  Similarity detection techniques offer solutions! —  Code clone detection from software engineering research can find similar code fragments – why not similar services? —  Topic models from data mining research can find text documents with similar semantics – why not similar services?
  • 4. Web Service Similarity —  Web services are stored in service registries, containing WSDL service description files —  Could apply clone detection to entire service descriptions —  But what we really want are similar service operations
  • 5. Let’s try it! <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType > <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >
  • 6. How about these? <operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation> <operation name="GetTopicBinaryChartCustom"> <input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>
  • 7. So what went wrong? —  At this point we thought maybe our idea wasn’t going to work —  Maybe clone detection can’t help with web service discovery? —  But why? What’s so special about WSDL?
  • 8. Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts:
  • 9. Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts: —  a <portType> element where the operations are declared;
  • 10. Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts: —  a <portType> element where the operations are declared; —  <message> elements corresponding to inputs, outputs and faults of the operations;
  • 11. Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts: —  a <portType> element where the operations are declared; —  <message> elements corresponding to inputs, outputs and faults of the operations; —  and a <types> element containing an XML Schema that defines the data and structure types used in the messages
  • 12. Web Service Description Language (WSDL) —  This simple example service has two operations:
  • 13. Web Service Description Language (WSDL) —  This simple example service has two operations: —  ReserveRoom
  • 14. Web Service Description Language (WSDL) —  This simple example service has two operations: —  ReserveRoom —  GetAvailableRooms
  • 15. Web Service Description Language (WSDL) —  WSDL service description files contain descriptions of the operations that a web service has to offer —  But the pieces of each operation’s own description are scattered over different parts of the WSDL file —  Difficult to identify complete units to analyze and compare
  • 16. The Problem —  This poses a problem for analysis techniques: —  Operations cannot easily be compared for similarity using clone detectors, because there are no contiguous fragments to compare —  And they cannot be analyzed using data mining topic models, because there are no separate complete documents to generate a model from
  • 17. Our Solution —  Our solution is to contextualize the original <operation> elements, to create self-contained operation descriptions —  We use source transformation to inline remote information from the context into the elements that reference or depend on them —  We call these contextualized WSDL operations Web Service Cells, or WSCells —  The first example of a new kind of clone detection: contextual clones
  • 18. Contextualizing WSDL Operations
  • 19. Contextual Clone Detection
  • 20. An Experiment —  We have run an experiment to investigate the difference between clone detection on WSCells and original raw operations —  Two sets of WSDL service description files: 1,100 operations and 7,500 operations —  Compared NICAD clone detector results for each set at various near-miss difference thresholds 0% = exact clone, 10% = 1 line in 10 different, and so on
  • 21. An Experiment —  Number of clones decreases with WSCells Difference   Threshold   Clone  Pairs  in  Set  1   Clone  Pairs  in  Set  2   Originals   WSCells   Originals   WSCells   0.0   852   705   1434   1066   0.1   852   734   1434   1228   0.2   879   775   1438   1637   0.3   884   813   1469   1637   <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType > <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType > —  Reduction in false positives
  • 22. An Experiment —  Number of clone classes can increase with WSCells Difference   Threshold   Clone  Classes  in  Set  1   Clone  Classes  in  Set  2   Originals   WSCells   Originals   WSCells   0.0   169   187   587   433   0.1   169   139   587   499   0.2   172   142   589   631   0.3   171   136   591   631   <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType > <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType > —  Splits by deeper differences – more precision
  • 23. Clone Detection for Web Services —  Contextual clone detection with WSCells works! —  Not only finds similar web service operations, but uncovers similar operations we could not find in any other way <operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation> <operation name="GetRealChartCustom"> <input message="GetRealChartCustomSoapIn"/> <output message="GetRealChartCustomSoapOut"/> </operation> <operation name="GetLastSaleChartCustom"> <input message="GetLastSaleChartCustomSoapIn"/> <output message="GetLastSaleChartCustomSoapOut"/> </operation> <operation name=“DrawYieldCurveCustom”> <input message=“DrawYieldCurveCustomIn”/> <output message=“DrawYieldCurveCustomOut”/> </operation> <operation name="GetTopicChartCustom"> <input message="GetTopicChartCustomSoapIn" /> <output message="GetTopicChartCustomSoapOut" /> <operation name="GetTopicBinaryChartCustom"> </operation> <input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>
  • 24. Semantic Analysis of Web Services —  Contextualized WSCells also make it possible to use data mining topic models to do semantic analysis of web services —  Because they provide self-contained documents of significant size —  Might topic models provide a different view of web service similarity?
  • 25. Latent Dirichlet Allocation —  Latent Dirichlet Allocation (LDA) : —  A statistical model to uncover latent topics —  Identifies the correlation between documents in terms of shared latent topics (sets of tokens) —  Accepts a set of documents (e.g., source files) as input, returns probability distributions over inferred topics (a topic model) as output —  Each document has some probability of being related to topic 1, another probability for topic 2, and so on —  Similar documents should be related to similar topics
  • 26. Latent Dirichlet Allocation —  Documents are represented in the model in terms of probability distributions over topics —  Similarity between documents is found using the Hellinger Distance —  A measure of how much agreement there is between the shared topics of two documents —  Almost identical documents have a small Hellinger Distance since they will be related to the same topics —  In terms of web services, small Hellinger Distances indicate highly related operations
  • 27. Evaluating WSCells —  To evaluate the use of WSCells with LDA, we : —  Generate an LDA model for the original <operation> elements, and another for the contextualized WSCells —  Explore the Global and Local Similarity between each pair of operations in the models —  Global Similarity an overall view of the most closely related web service operations in the service set —  Local Similarity a per-operation view of the other most related web service operations for each operation
  • 28. Global Similarity —  We look at Global Similarity using a visualization called Bluevis —  Bluevis shows the global conceptual structure of a system by highlighting similar operations using an illuminated line from left-to-right —  Plot some top fraction of similar operations (top 25,000 in our examples) —  Use a consistently ordered list of web service operations for the LDA model to view the differences —  If a display is noisy, it is often an indication that the model is not identifying meaningful data
  • 29. Global Similarity
  • 30. Global Similarity —  For original raw operations: —  Bluevis highlights the LDA most similar operations —  Some clear structure —  However, most of this is due to shared keywords, like get and SOAP —  This uncontextualized model has very little value
  • 31. Global Similarity
  • 32. Global Similarity —  For contextualized WSCells: —  A clearer semantic structure, less noise overall —  Operation similarity becomes meaningful —  Services with semantic similarity discovered —  E.g., Operations with similar parameters or faults, such as those that manipulate holiday dates or financial rates
  • 33. Local Similarity —  We can also examine the local similarity for each individual operation —  Identify the complete ordered list of similarity scores for an operation in the data set —  Using the top similarity scores, evaluate how meaningful the data is from a user's perspective —  For example, how can I find the most similar web service operations to the one I am using now? —  We use a tool called POCO (Pairwise Observation of Concepts) to examine the most similar operations
  • 34. Local Similarity
  • 35. Local Similarity Operation Most similar WSCell Most similar original raw WSDL operation ListFinancials GetFinancialServicesFromList LanguagesList ExportShipsAndCategories ExportIteneraryAndSteps Search GetIssueData GetFlightData word_cloud GetWeatherReport GetWeather GetIndices GetAIDIBOR GetTRLIBOR GetCarriers searchByIdentifier searchByNameAndAddress GetLastSecurityHeadlines ToolsAndHardwareBox KitchenAndHousewareBox ListRenditions GetReservations GetRoomAvailabilityForDay GetSOFIBOR GetOtherProductInfo NextOtherProductPortion GetParkingInfo GetAllSplitsByExchange GetAllCashDividendsByExchange GetTeamLoyalties2
  • 36. Summary —  Very-high-level domain-specific languages such as WSDL make poor targets for similarity analysis using clone detection and topic models —  Lack of local context prevents meaningful results —  Contextualizing using WSCells exposes both cloning and semantic relationships between web operations —  Clone detection of WSCells identifies similar web service operations —  Topic models of WSCells expose both global system-wide semantic relationships and local individual relationships between operations
  • 37. Current & Future —  Continue analysis of web services for the Personal Web using our results —  Apply contextualization to similarity analysis of other modeling and specification languages (currently Simulink, Stateflow and UML sequence diagrams) —  Experiment with effect of contextualization on clone and topic model analysis of traditional languages such as Java and C (“contextual clones”)
  • 38. When is a Clone not a Clone? (and vice-versa) Contextualized Analysis of Web Services Douglas Martin Scott Grant James R. Cordy David B. Skillicorn Questions?