Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing Reviews and Code of Mobile Apps for Better Release Planning

311 views

Published on

The mobile applications industry experiences an unprecedented high growth, developers working in this context face a fierce competition in acquiring and retaining users.
They have to quickly implement new features and fix bugs, or risks losing their users to the competition. To achieve this goal they must closely monitor and analyze the user feedback they receive in form of reviews. However, successful apps can receive up to several thousands of reviews per day, manually analysing each of them is a time consuming task. To help developers deal with the large amount of available data, we manually analyzed the text of 1566 user reviews and defined a high and low level taxonomy containing mobile specific categories (e.g. performance, resources, battery, memory, etc.) highly relevant for developers during the planning of maintenance and evolution activities. Then we built the User Request Referencer (URR) prototype, using Machine Learning and Information Retrieval techniques, to automatically classify reviews according to our taxonomy and recommend for a particular review what are the source code files that need to be modified to handle the issue described in the user review. We evaluated our approach through an empirical study involving the reviews and code of 39 mobile applications. Our results show a high precision and recall of URR in organising reviews according to the defined taxonomy

  • Be the first to comment

  • Be the first to like this

Analyzing Reviews and Code of Mobile Apps for Better Release Planning

  1. 1. Analyzing Reviews and Code of Mobile Apps for Better Release Planning Adelina Ciurumelea, Andreas Schaufenbühl, Sebastiano Panichella, Harald C. Gall software evolution & architecture lab
  2. 2. 2 Extremely Popular Apps 8,087,067 reviews3,505,905 reviews38,742,600 reviews
  3. 3. 3 Open Source Apps 62,707 reviews
  4. 4. 4 The number of reviews is large compared to the available development resources.
  5. 5. 5 • reviews contain valuable feedback directly from the users • users often report bugs, user experience and request features • the review content influences the number of downloads Importance of reviews
  6. 6. 6 INFORMATIVE NON-INFORMATIVE “AR-Miner: Mining informative reviews for developers from mobile app marketplace” N. Chen, J. Lin, S. Hoi, X. Xiao, and B. Zhang
  7. 7. 7 BUG FEATURE REQUEST “Release planning of mobile apps based on user reviews” L. Villarroel, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta OTHER
  8. 8. 8 BUGFEATURE REQUEST • the developer has to manually analyse the unstructured groups of reviews, understand what they talk about and extract actionable change tasks • what does a particular cluster talk about? Does it talk about the UI or about the performance of the app, etc.?
  9. 9. 9 What are the mobile specific topics users talk about in their reviews?
  10. 10. 10 manual analysis of ~1600 reviews
  11. 11. 11 Hmmm... Mm No… This is IT Nope Nopity nope • not all reviews are useful
  12. 12. 12 Hmmm... Mm No… This is IT Nope Nopity nope Sucks Way to many errors 0 stars Garbage. problem bro Garbage Bla bla bla • not all reviews are useful • some are even offensive
  13. 13. 13 Pretty close to perfect, this app is way better than any comic book reader I've ever used. It's small, it operates fast, and the interface is incredibly clean and simple. • others can provide valuable information for the developer
  14. 14. 14 Pretty close to perfect, this app is way better than any comic book reader I've ever used. It's small, it operates fast, and the interface is incredibly clean and simple. Resources Usage
  15. 15. 15 For info (in case dev not already aware!), there is a graphical glitch when scrolling output in marshmallow on a nexus 5. Compatibility Usage Complaint
  16. 16. 16 Building the taxonomy • feature extraction: TF-IDF scores and 2 and 3- grams counts Content analysis in 2 passes: • start with an empty list of categories • analyse each review and add a new category (including definition and keywords) if necessary • label the review with all the matching categories • second pass: revisit the list of reviews and label them with the appropriate categories
  17. 17. 17 Category Description Compatibility mentions the OS, mobile device or a specific hardware component. Usage talks about the UI or the usability of the app. Resources mentions the app’s influence on the battery and memory usage or the performance of the app/phone. Pricing statements mentioning the license model or the price of the app. Protection statements referring to security or privacy issues. Complaint the user reports or complains about an issue with the app. High Level Taxonomy
  18. 18. 18 specialise the taxonomy further
  19. 19. 19 Liked it and worked very well in lollipop, but not MM The plugins don't refresh, manual navigation to next image doesn't work. Some plugins give error. Altogether seems broken after MM update on Note 4. Compatibility
  20. 20. 20 Liked it and worked very well in lollipop, but not MM The plugins don't refresh, manual navigation to next image doesn't work. Some plugins give error. Altogether seems broken after MM update on Note 4. Compatibility Device Android Version
  21. 21. 21 High Level Low Level Categories Compatibility Device, Android Version, Hardware Usage App Usability, UI Resources Performance, Battery, Memory Pricing Licensing, Price Protection Security, Privacy Low Level Taxonomy
  22. 22. 22 Automated Classification
  23. 23. 23 Gradient Boosted Trees Training Preprocessing & Feature Extraction Multi-label Classification ML Approach
  24. 24. 24 Preprocessing & Feature Extraction • preprocessing: stop words removal and stemming • feature extraction: TF-IDF scores and 2 and 3- grams counts
  25. 25. 25 Training • feature extraction: TF-IDF scores and 2 and 3- grams counts • one-vs-all strategy: separate classifier for each high and low level category (18 in total) • used the Gradient Boosted Trees model
  26. 26. 26 Multi-label Classification Preprocessing Feature Extraction Classification High & Low Level Categories ++ ++ … Battery UI Complaint Resources Usage
  27. 27. 27 Example • feature extraction: TF-IDF scores and 2 and 3- grams counts RQ2: Does our approach correctly recommend the software artifacts that need to be modified in order to handle user requests and complaints? • 752 user reviews from our dataset belong to AcDisplay • analyse Compatibility and Complaint reviews (61 reviews) • Complaint and Android Version (22 reviews)
  28. 28. 28 Example • feature extraction: TF-IDF scores and 2 and 3- grams counts “Good but has some issues with Marshmallow I used this on my old phone and if was flawless and I loved it. I noticed that sometimes when I had AcDisplay activated I would not be able to use the fingerprint sensor even after I unlocked AcDisplay and had to enter a password. This is very frustrating so I cannot use AcDisplay.” “Love the design I love the app. It’s super sleek and nice. But ever since my phone updated to marshmallow it’s stopped working. Hope it comes back soon.” “On Marshmallow, the screen is buggy and sometimes shows the notification shade.”
  29. 29. 29 • feature extraction: TF-IDF scores and 2 and 3- grams counts • can we link reviews to the related source code? • IR methods based on the VSM (hard task: the vocabulary used by reviews and source code is different) • use additional Android project specific information (e.g. UI functionality is implemented in Activity classes) Source Code Localisation
  30. 30. 30 Source Code Localisation Android Project Structure Info IR - VSM Software Artifacts App’s Source Code User Reviews
  31. 31. 31 Evaluation • feature extraction: TF-IDF scores and 2 and 3- grams counts RQ1: To what extent does our approach organise reviews according to meaningful maintenance and evolution tasks for developers? RQ2: Does our approach correctly recommend the software artifacts that need to be modified in order to handle user requests and complaints?
  32. 32. 32 Reviews Source Code
  33. 33. 33 Study RQ1 • feature extraction: TF-IDF scores and 2 and 3- grams counts • ~7800 user reviews from 39 apps
  34. 34. 34 Study RQ1 • feature extraction: TF-IDF scores and 2 and 3- grams counts • 2 external evaluators • evaluate 200 reviews for each category (3600 total)
  35. 35. 35 Results RQ1 High Level Category Precision Recall F1 Score Compatibility 71% 97% 82% Usage 89% 94% 91% Resources 79% 99% 88% Pricing 85% 97% 90% Protection 89% 98% 93% Complaint 85% 80% 82%
  36. 36. 36 Results RQ1 High Level Category Low Level Category Precision Recall F1 Score Compatibility Device OS Version Hardware 85% 89% 61% 98% 86% 95% 91% 87% 74% Usage App Usability UI 92% 83% 91% 93% 91% 88% Resources Performance Battery Memory 64% 78% 68% 97% 95% 95% 77% 86% 79% Pricing Licensing Price 91% 85% 98% 96% 94% 90% Protection Security Privacy 87% 83% 98% 96% 92% 89%
  37. 37. 37 Results RQ1 Our approach is able to classify reviews with high precision and recall according to the mobile specific topics we derived. The most important categories are Usage, Resources and Compatibility.
  38. 38. 38 Study RQ2 • 1 external evaluator • 91 user reviews from 2 apps
  39. 39. 39 Results RQ2 • feature extraction: TF-IDF scores and 2 and 3- grams counts Quality of Reviews Precision Recall F1 Score Difficult to Link 41% 83% 55% Easier to Link 52% 79% 63% All 51% 79% 62%
  40. 40. 40 Results RQ2 Our approach achieves promising results in recommending related software artifacts for specific user reviews, furthermore better quality reviews are easier to link than lower quality ones.
  41. 41. 41 Conclusion & Future Work • reviews can be classified with high precision and recall using machine learning according to mobile specific topics • linking reviews to source code using textual similarity based methods is difficult • future work: summarise reviews, improve localisation (static analysis)
  42. 42. 42 Discussion What mechanisms can we adopt for enabling a reliable and practical solution for code localisation?

×