130919   jim cordy - when is a clone not a clone
Upcoming SlideShare
Loading in...5
×
 

130919 jim cordy - when is a clone not a clone

on

  • 167 views

Software clone, detection, empirical studies, validation

Software clone, detection, empirical studies, validation

Statistics

Views

Total Views
167
Views on SlideShare
167
Embed Views
0

Actions

Likes
0
Downloads
229
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

130919   jim cordy - when is a clone not a clone 130919 jim cordy - when is a clone not a clone Presentation Transcript

  • When is a Clone not a Clone? (and vice-versa) Contextualized Analysis of Web Services Douglas Martin Scott Grant James R. Cordy David B. Skillicorn School of Computing Kingston, Canada
  • Motivation —  The Personal Web —  Rapidly growing number of web services makes it increasingly difficult to find and choose the right ones —  Need a quick and convenient way to find alternatives —  Hand tagging impractical – automation is needed!
  • Motivation —  Automation —  Similarity detection techniques offer solutions! —  Code clone detection from software engineering research can find similar code fragments – why not similar services? —  Topic models from data mining research can find text documents with similar semantics – why not similar services?
  • Web Service Similarity —  Web services are stored in service registries, containing WSDL service description files —  Could apply clone detection to entire service descriptions —  But what we really want are similar service operations
  • Let’s try it! <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType > <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >
  • How about these? <operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation> <operation name="GetTopicBinaryChartCustom"> <input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>
  • So what went wrong? —  At this point we thought maybe our idea wasn’t going to work —  Maybe clone detection can’t help with web service discovery? —  But why? What’s so special about WSDL?
  • Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts:
  • Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts: —  a <portType> element where the operations are declared;
  • Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts: —  a <portType> element where the operations are declared; —  <message> elements corresponding to inputs, outputs and faults of the operations;
  • Web Service Description Language (WSDL) —  A WSDL service description has 3 main parts: —  a <portType> element where the operations are declared; —  <message> elements corresponding to inputs, outputs and faults of the operations; —  and a <types> element containing an XML Schema that defines the data and structure types used in the messages
  • Web Service Description Language (WSDL) —  This simple example service has two operations:
  • Web Service Description Language (WSDL) —  This simple example service has two operations: —  ReserveRoom
  • Web Service Description Language (WSDL) —  This simple example service has two operations: —  ReserveRoom —  GetAvailableRooms
  • Web Service Description Language (WSDL) —  WSDL service description files contain descriptions of the operations that a web service has to offer —  But the pieces of each operation’s own description are scattered over different parts of the WSDL file —  Difficult to identify complete units to analyze and compare
  • The Problem —  This poses a problem for analysis techniques: —  Operations cannot easily be compared for similarity using clone detectors, because there are no contiguous fragments to compare —  And they cannot be analyzed using data mining topic models, because there are no separate complete documents to generate a model from
  • Our Solution —  Our solution is to contextualize the original <operation> elements, to create self-contained operation descriptions —  We use source transformation to inline remote information from the context into the elements that reference or depend on them —  We call these contextualized WSDL operations Web Service Cells, or WSCells —  The first example of a new kind of clone detection: contextual clones
  • Contextualizing WSDL Operations
  • Contextual Clone Detection
  • An Experiment —  We have run an experiment to investigate the difference between clone detection on WSCells and original raw operations —  Two sets of WSDL service description files: 1,100 operations and 7,500 operations —  Compared NICAD clone detector results for each set at various near-miss difference thresholds 0% = exact clone, 10% = 1 line in 10 different, and so on
  • An Experiment —  Number of clones decreases with WSCells Difference   Threshold   Clone  Pairs  in  Set  1   Clone  Pairs  in  Set  2   Originals   WSCells   Originals   WSCells   0.0   852   705   1434   1066   0.1   852   734   1434   1228   0.2   879   775   1438   1637   0.3   884   813   1469   1637   <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType > <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType > —  Reduction in false positives
  • An Experiment —  Number of clone classes can increase with WSCells Difference   Threshold   Clone  Classes  in  Set  1   Clone  Classes  in  Set  2   Originals   WSCells   Originals   WSCells   0.0   169   187   587   433   0.1   169   139   587   499   0.2   172   142   589   631   0.3   171   136   591   631   <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType > <operation name="GetStock" > <input message="tns:GetStockRequest" /> <complexType name=“Stock”> <output message="tns:GetStockResponse" /> <sequence> </operation> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType > —  Splits by deeper differences – more precision
  • Clone Detection for Web Services —  Contextual clone detection with WSCells works! —  Not only finds similar web service operations, but uncovers similar operations we could not find in any other way <operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation> <operation name="GetRealChartCustom"> <input message="GetRealChartCustomSoapIn"/> <output message="GetRealChartCustomSoapOut"/> </operation> <operation name="GetLastSaleChartCustom"> <input message="GetLastSaleChartCustomSoapIn"/> <output message="GetLastSaleChartCustomSoapOut"/> </operation> <operation name=“DrawYieldCurveCustom”> <input message=“DrawYieldCurveCustomIn”/> <output message=“DrawYieldCurveCustomOut”/> </operation> <operation name="GetTopicChartCustom"> <input message="GetTopicChartCustomSoapIn" /> <output message="GetTopicChartCustomSoapOut" /> <operation name="GetTopicBinaryChartCustom"> </operation> <input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>
  • Semantic Analysis of Web Services —  Contextualized WSCells also make it possible to use data mining topic models to do semantic analysis of web services —  Because they provide self-contained documents of significant size —  Might topic models provide a different view of web service similarity?
  • Latent Dirichlet Allocation —  Latent Dirichlet Allocation (LDA) : —  A statistical model to uncover latent topics —  Identifies the correlation between documents in terms of shared latent topics (sets of tokens) —  Accepts a set of documents (e.g., source files) as input, returns probability distributions over inferred topics (a topic model) as output —  Each document has some probability of being related to topic 1, another probability for topic 2, and so on —  Similar documents should be related to similar topics
  • Latent Dirichlet Allocation —  Documents are represented in the model in terms of probability distributions over topics —  Similarity between documents is found using the Hellinger Distance —  A measure of how much agreement there is between the shared topics of two documents —  Almost identical documents have a small Hellinger Distance since they will be related to the same topics —  In terms of web services, small Hellinger Distances indicate highly related operations
  • Evaluating WSCells —  To evaluate the use of WSCells with LDA, we : —  Generate an LDA model for the original <operation> elements, and another for the contextualized WSCells —  Explore the Global and Local Similarity between each pair of operations in the models —  Global Similarity an overall view of the most closely related web service operations in the service set —  Local Similarity a per-operation view of the other most related web service operations for each operation
  • Global Similarity —  We look at Global Similarity using a visualization called Bluevis —  Bluevis shows the global conceptual structure of a system by highlighting similar operations using an illuminated line from left-to-right —  Plot some top fraction of similar operations (top 25,000 in our examples) —  Use a consistently ordered list of web service operations for the LDA model to view the differences —  If a display is noisy, it is often an indication that the model is not identifying meaningful data
  • Global Similarity
  • Global Similarity —  For original raw operations: —  Bluevis highlights the LDA most similar operations —  Some clear structure —  However, most of this is due to shared keywords, like get and SOAP —  This uncontextualized model has very little value
  • Global Similarity
  • Global Similarity —  For contextualized WSCells: —  A clearer semantic structure, less noise overall —  Operation similarity becomes meaningful —  Services with semantic similarity discovered —  E.g., Operations with similar parameters or faults, such as those that manipulate holiday dates or financial rates
  • Local Similarity —  We can also examine the local similarity for each individual operation —  Identify the complete ordered list of similarity scores for an operation in the data set —  Using the top similarity scores, evaluate how meaningful the data is from a user's perspective —  For example, how can I find the most similar web service operations to the one I am using now? —  We use a tool called POCO (Pairwise Observation of Concepts) to examine the most similar operations
  • Local Similarity
  • Local Similarity Operation Most similar WSCell Most similar original raw WSDL operation ListFinancials GetFinancialServicesFromList LanguagesList ExportShipsAndCategories ExportIteneraryAndSteps Search GetIssueData GetFlightData word_cloud GetWeatherReport GetWeather GetIndices GetAIDIBOR GetTRLIBOR GetCarriers searchByIdentifier searchByNameAndAddress GetLastSecurityHeadlines ToolsAndHardwareBox KitchenAndHousewareBox ListRenditions GetReservations GetRoomAvailabilityForDay GetSOFIBOR GetOtherProductInfo NextOtherProductPortion GetParkingInfo GetAllSplitsByExchange GetAllCashDividendsByExchange GetTeamLoyalties2
  • Summary —  Very-high-level domain-specific languages such as WSDL make poor targets for similarity analysis using clone detection and topic models —  Lack of local context prevents meaningful results —  Contextualizing using WSCells exposes both cloning and semantic relationships between web operations —  Clone detection of WSCells identifies similar web service operations —  Topic models of WSCells expose both global system-wide semantic relationships and local individual relationships between operations
  • Current & Future —  Continue analysis of web services for the Personal Web using our results —  Apply contextualization to similarity analysis of other modeling and specification languages (currently Simulink, Stateflow and UML sequence diagrams) —  Experiment with effect of contextualization on clone and topic model analysis of traditional languages such as Java and C (“contextual clones”)
  • When is a Clone not a Clone? (and vice-versa) Contextualized Analysis of Web Services Douglas Martin Scott Grant James R. Cordy David B. Skillicorn Questions?