Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Harnessing the Crowds for Automating the Identification of Web APIs


Published on

Supporting the efficient discovery and use of Web APIs is increasingly important as their use and popularity grows. Yet, a simple task like finding potentially inter- esting APIs and their related documentation turns out to be hard and time consuming even when using the best resources currently available on the Web. We describe our research towards an automated Web API documentation crawler and search engine. We have devised and exploited crowdsourcing techniques to generate a curated dataset of Web APIs documentation. Thanks to this dataset, we have devised an engine able to automatically detect documentation pages. Our preliminary experiments have shown that we obtain an accuracy of 80% and a precision increase of 15 points over a keyword-based heuristic we have used as baseline.

Published in: Technology, Education
  • Be the first to comment

Harnessing the Crowds for Automating the Identification of Web APIs

  1. 1. Harnessing The Crowds For Automating The Identification Of Web APIs Carlos Pedrinaci, Chenghua Lin, Dong Liu, John Domingue KMi, The Open University
  2. 2. Web APIs are the Publicly offering valuable data and functionalitynew WEB services Widely used and reused Although their use is hardly automated
  3. 3. Web APIs and RESTful Services• Services based on a simple(r) stack of technologies than WS-* • Roughly URL + HTTP + XML/JSON• Easy way to provide a programmatic interface to existing Web sites• Seldom adopt REST principles
  4. 4. How to Discover Web APIs?
  5. 5. Po or Res ul ts
  6. 6. OK Re su lts
  7. 7. Po or Res ul ts
  8. 8. Ou to fd ate
  9. 9. Issues for Discovering Web APIs• There is no simple way to effectively and uniquely identify Web APIs • No standardised document describing the interface • URLs are hardly usable for this end
  10. 10. How can weautomatically find Web APIs?
  11. 11. Hypothesis• Every Web API provides a/several public documentation page(s)• These pages provide the most relevant information for developers‣ Web API location can be approached as a documentation discovery problem
  12. 12. Web API Given a Web page determine if it documents an API or notIdentification Sometimes a hard problem even for humans
  13. 13. Collecting Harnessing the crowds for detecting documentation pagesdocumentation Pages
  14. 14. Generating a Often the links are obsolete or point to general pagescurated dataset
  15. 15. Dataset Generated• We used API Validator to process 1,872 APIs from ProgrammableWeb • 43% of the URLs we started with (data from 2010) • 624 a documentation page • 929 not a documentation page • 318 skipped (server down or unclear)
  16. 16. Web API identification Engine• Web API identification as a binary classification problem• Extract core features from Web pages• Use machine learning algorithms to provide an identification engine
  17. 17. Preliminary Experiment• Used initially only Web page words as a feature• Trained two classifiers NB and SVM• Used a simple keyword-based heuristic as baseline for comparison (the occurrence of 3 or more keywords) • api, input, output, GET, PUT, etc
  18. 18. Evaluation Results Model Precision Recall F1 AccuracyKeyword 60.3 75.7 67.0 70.2 NB 71.0 79.2 74.8 78.6 SVM 75.4 70.8 73.1 79.0
  19. 19. Evaluation Results• Although preliminary the approach already provides promising results• Both NB and SVM provide a good accuracy (about 80%)• Best Precision (75.4%) achieved by SVM which is 15 points better than the baseline
  20. 20. Conclusions and Future Work• Discovering Web APIs is becoming increasingly important and existing support is not optimal• Web APIs identification is a first step that can well be approached as a documentation identification problem• Crowds input (ProgWeb and API Validator) has been essential
  21. 21. Conclusions and Future Work• Further features are been included for improving the results • Title, URL, presence of camelCase words • Current tests have reached an accuracy of 82% using SGD
  22. 22. Conclusions and Future Work• A larger training set is necessary • Need more validated pages (help!) •• A larger experiment will be carried over a normal Web crawl
  23. 23. Thanks for your attention