Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Session 2.5 semantic similarity based clustering of license excerpts for improved end-user interpretation

99 views

Published on

Talk at SEMANTiCS 2017
www.semantics.cc

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Session 2.5 semantic similarity based clustering of license excerpts for improved end-user interpretation

  1. 1. Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Najmeh Mousavi Nejad, Simon Scerri & Sören Auer SEMANTiCS17 - 13th International Conference on Semantic Systems Amsterdam, September 11 - 14. 2017
  2. 2. 2 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Motivation  Online research commissioned by Skandia1  Only 7% read online End-User License Agreements (EULAs) when signing up for products & services  21% suffered as a result of ticking EULA box without reading them 1 http://www.prnewswire.co.uk/news-releases/skandia-takes-the-terminal-out-of-terms-and-conditions-145280565.html  10% locked into a longer term contract than they expected  5% lost money by not being able to cancel or amend hotels or holidays
  3. 3. 3 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Problem Statement  Given an EULA (End-User License Agreement) in the natural language, we want to provide a user-friendly summary of permissions, prohibitions & duties.  Focus of this Work: clustering similar extracted excerpts of EULAs for the benefit of end-user You may reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. You may reproduce and distribute copies of the Work or Derivative Works with or without modifications and You may add Your own attribution notices within Derivative Works
  4. 4. 4 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Related Work  Manual: online service (tldrlegal.com)  Semi-automatic  NLL2RDF (Cabrio, et al., ESWC 2014)  First attempt to generate RDF expressions of EULAs  Exploit CC REL & ODRL vocabularies & supervised machine learning  Limitation: few number of rights  EULAide (Mousavi Nejad, et al., SEMANTiCS 2016)  Exploits ontology-based information extraction to extract permissions, prohibitions & duties from EULAs  Taken as a basis in this work  Limitation: a lot of extracted excerpts for long EULAs
  5. 5. 5 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Approach  Building a similarity matrix for each class considering different features of extracted segments  Features for each class (e.g., permission, prohibitions & duties) are extracted with JAPE rules  Action, Condition, PolicyType  Exploiting a distributional semantic approach for computing short text similarity  DISCO: DIstributionally related words using CO-occurrences  Has a word space builder: creates the word embedding database over a text corpus  Computes semantic similarity based on the word vectors
  6. 6. 6 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Approach – Continued  Using a hierarchical clustering algorithm to cluster each class (e.g., permission, prohibitions & duties) a c , d e , f a , c , d b e , f g b c d e f g
  7. 7. 7 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Architecture User Front-end Web Interface HTML, CSS, Angular JS EULA File Permission, Duty & Prohibition clusters Back-end FormData/ String XML/JSON data Application Server Ontology File Permission, Duty, Prohibition DISCO API 3 Similarity Matrixes for the 3 classes Clustering Algorithm 3 classes, each clustered based on their similarities Customized GATE OBIE Pipeline (feature extraction phase added) Stanford Dependency Parser Annotations with features Annotations with extracted objects Or URL
  8. 8. 8 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Feature Extraction  Three different features are extracted with JAPE rules  Sequence of actions: copy, share, distribute  Condition on which a specific action is granted or forbidden or obliged  PolicyType: copyright, patent or intellectual property right If you join a Dropbox for business account, you must use it in compliance with your employer’s terms & policies. Each contributor grants you a patent license to make, use, sell, import and transfer the work. condition action policy type action
  9. 9. 9 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Sketch of Semantic Clustering Algorithm Input: permissions, prohibitions, duties with features 1: for all three classes do 2: for all segment pairs in each class do 3: A = similarity between actions 4: B = similarity between conditions 5: C = similarity between policy types 6: D = similarity between the remainders of segments 7: finalSim = A + B + C + D 8: add finalSim to the corresponding matrix cell 9: end for 10: do HAC clustering for the matrix with a threshold 11: end for Output: clustered permissions, prohibitions & duties
  10. 10. 10 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation EULAide Platform Web Interface
  11. 11. 11 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Experiments  The Clustering Approach Evaluation  Does the semantic based clustering compress information?  Does the feature extraction phase improve the result?  Usability experiments  Does using EULAide need less time and effort for EULA comprehension?  Does semi-automatic IE lead to information loss?  How easy is exploiting EULAide regarding human perception?
  12. 12. 12 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Clustering Approach Evaluation  Result of clustering for 4 EULAs  Clusters-M (Baseline): without features  Clusters-Mf (our algorithm): with feature extraction phase #Instances #Baseline Clusters #Our Approach Clusters Permission 30 18 20 Duty 27 24 20 Prohibition 40 32 35  Does the semantic based clustering compress information? YES
  13. 13. 13 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Clustering Approach Evaluation  Compiling a gold standard for EULA clustering is hard  Solution: measuring the inter-annotator agreement approximately  Five subject were asked to devise their own clustering criteria as they best deemed fit  Computing the agreement between them: 𝑅𝑎𝑛𝑑 𝐼𝑛𝑑𝑒𝑥 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁  Does the feature extraction phase improve the result? h1 h2 h3 h4 h5 M Mf h1 (1) 0.71 0.89 0.91 0.71 0.93 0.97 h2 * (1) 0.61 0.63 0.95 0.7 0.68 h3 * * (1) 0.92 0.63 0.86 0.86 h4 * * * (1) 0.65 0.85 0.88 h5 * * * * (1) 0.7 0.67 RI of clustering Result for Duties
  14. 14. 14 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Clustering Approach Evaluation  Accumulated deviation of Mf from Human = 9%  Accumulated deviation of M from Human = 10%  Does the feature extraction phase improve the result? Human M Mf Permission 0.79 0.83 0.81 Duty 0.76 0.81 0.81 Prohibition 0.83 0.84 0.85  Feature-based approach:  Generates more fined-grained clusters  Are more attuned to human intuition and perception average rand index YES
  15. 15. 15 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Usability Evaluation  Does using EULAide need less time and effort for EULA comprehension?  Does semi-automatic IE lead to information loss?  Experimental setup for the 4 EULAs  A legal expert designed 5 multiple choices questions for each EULA  6 students read EULAs in 2 modes: natural text & EULAide h1 h2 h3 h4 h5 h6 1-full × × × 1-EULAide × × × 2-full × × × 2-EULAide × × × 3-full × × × 3-EULAide × × × 4-full × × × 4-EULAide × × ×  They answered the questions in 2 phases  Phase1: using memory without looking  Phase2: using search tools for finding the answers
  16. 16. 16 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Usability Evaluation Correct Incorrect Unanswered in phase1 Phase2 correct Phase2 incorrect Phase2 Unanswered EULA-full 67 8 18.5 5 1.5 EULAide 62 15 6.5 4.5 12 Reading Answering phase1 Answering phase2 EULA-full 1185 75 152 EULAide 315 72 77  Does using EULAide need less time and effort for EULA comprehension? YES  Does semi-automatic IE lead to information loss? YES average time in seconds average percentage of questions results (%)
  17. 17. 17 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Usability Test  How easy is using EULAide regarding human perception?  USE questionnaire: usefulness, satisfaction, ease of learning and ease of use1  Contains 30 questions  1 = strongly dissatisfied, 7 = strongly satisfied  With 6 participants 1 http://garyperlman.com/quest/quest.cgi?form=USE Usefulness Ease of use Ease of learning satisfaction 6.14 6.11 6.75 6.0  Recommendations by participants:  Extending the idea of summary in the header of each accordion  Including other aspects of EULAs: what is the agreement between us and the service provider?
  18. 18. 18 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Conclusions & Future Works  EULAide is the first comprehensive approach for EULA interpretation  Clustering is effective and reduces the number of relevant terms for users to focus on initially  EULAide is visual & simple to digest EULAs  It saves around 75% of the time  But it has a marginal price of 10.5% loss of valuable information  Future works  Improving the feature extraction phase  Extending the summarization technique of each cluster for the benefit of end users nejad@cs.uni-bonn.de
  19. 19. THANK YOU ! nejad@cs.uni-bonn.de
  20. 20. 20 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation References  All images are from Pixabay which are released under Creative Commons CC0 into the public domain.
  21. 21. 21 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Backup Slide  Agglomerative hierarchical clustering (HAC) is an established, well-known technique which has been shown to be a successful method for text and document clustering  Furthermore, among different HAC methods, the average linkage has been proved to be the most suitable one for text categorization [1, 24]. Once the proper clustering technique is identifed, we can pass similarity matrices to the clustering component. The HAC process continues until it reaches a pre-defned threshold.  In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster
  22. 22. 22 Najmeh Mousavi Nejad, Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation Backup Slide  DISCO computation method 𝑆𝑖𝑚 𝑇1, 𝑇2 = 𝑑𝑖𝑟𝑒𝑐𝑡𝑒𝑑𝑆𝑖𝑚 𝑇1, 𝑇2 + 𝑑𝑖𝑟𝑒𝑐𝑡𝑒𝑑𝑆𝑖𝑚(𝑇2, 𝑇1) 2 𝑑𝑖𝑟𝑒𝑐𝑡𝑒𝑑𝑆𝑖𝑚 𝑇1, 𝑇2 = 𝑖=1 𝑛 [ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑤𝑖1)* max 1≤𝑗≤𝑛 𝑊𝑜𝑟𝑑𝑆𝑖𝑚 𝑤𝑖1, 𝑤𝑗2 ]  5 participants could reveal about 80% of all usability problems that exist in a product (Nielsen, 1993).

×