Your SlideShare is downloading. ×
0
seekda‘s Web Service Search Engine                                                         Nathalie Steinmetz             ...
seekda Web Service Search Engine                                                                               2© Copyrigh...
Motivation       “Web of services”           Growing amount of public services & data on the Web           Problem: How...
Outline       Web Service search engine - basics           Focused Crawling           WSDL-based services           We...
Service Location       Locating Web Services on the Web (Approach adopted by        European projects Service-Finder & SO...
Service Crawler Architecture                        Crawl Operator                                                        ...
Crawling the Web for Services       Basic crawling process:             Start with a set of seed URLs             Check...
Focused Crawling Techniques       Seed Collection           Collecting seeds from specialized portals           Reuse k...
Identify WSDLs and Related Information       WSDL identification           Check whether a fetched page is XML and valid...
Unique Service Objects       Building unique service objects           Collect all similar WSDLs  deduplication        ...
Search Results                                                            11© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Overview                                                               12© Copyright 2012 SEEKDA GmbH – www.seekda...
seekda Web Service Search Engine                                                                               13© Copyrig...
Why crawl for Web APIs?       Significant growth of Web APIs           > 5,400 Web APIs on ProgrammableWeb (including SO...
Web API – Example (1/3)                                                                      15© Copyright 2012 SEEKDA Gmb...
Web API – Example (2/3)                                                                      16© Copyright 2012 SEEKDA Gmb...
Web API – Example (3/3)       Problem:           Web APIs are            described by regular            HTML pages     ...
Web API Identification       Solution: Crawl for Web APIs           Approach 1: Manual Feature Identification Approach  ...
Unique Service Objects – Web APIs       Create unique identifiers:           Again using the provider name (from the Web...
New Search Engine Prototype                                                                          20© Copyright 2012 SE...
Prototype – User Contributions       Web API – yes/no: confirmation from        human needed!       Other annotations th...
Problem - User Contribution       Problem:           Users/developers don’t contribute enough           Hard to motivat...
Service Annotation Wizard (1/4)                                                                             23© Copyright ...
Service Annotation Wizard (2/4)                                                                             24© Copyright ...
Service Annotation Wizard (3/4)                                                                             25© Copyright ...
Service Annotation Wizard (4/4)                                                                             26© Copyright ...
Amazon Mechanical Turk – Iteration 1                        Number of Submissions               70                        ...
Amazon Mechanical Turk – Iteration 1       Results             21 APIs correctly identified as APIs             28 Web ...
Amazon Mechanical Turk – Iterations 2 & 3                                                Iteration 2   Iteration 3        ...
Amazon Mechanical Turk – Iteration 2 & 3       Results Iteration 2 & 3:           Ca. 80% of documents correctly identif...
Amazon Mechanical Turk – Survey       48 survey submissions           Female 18, Male 30           Most popular origins...
Amazon Mechanical Turk       Recommendations for further improvement:             Improve task description, especially ‘...
Service Ontologies (1/2)                                                                      33© Copyright 2012 SEEKDA Gm...
Service Ontologies (2/2)                                                http://www.service-finder.eu/ontologies/ServiceCat...
Questions?                                                             35© Copyright 2012 SEEKDA GmbH – www.seekda.com
Upcoming SlideShare
Loading in...5
×

seekda's Web Service search engine

1,612

Published on

Presentation at the Semantic Web meetup in Seattle, WA, USA, in March 2012: http://www.meetup.com/Semantically-Webbed-Seattle-Meetup-Group/events/52635992/

Published in: Technology
4 Comments
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
1,612
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
35
Comments
4
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "seekda's Web Service search engine"

  1. 1. seekda‘s Web Service Search Engine Nathalie Steinmetz seekda GmbH 1© Copyright 2012 SEEKDA GmbH – www.seekda.com
  2. 2. seekda Web Service Search Engine 2© Copyright 2012 SEEKDA GmbH – www.seekda.com
  3. 3. Motivation  “Web of services”  Growing amount of public services & data on the Web  Problem: How do I find the service I need?  General search engine: services hard to identify, not much information on results page  Specific portals: access to restricted sets of registered and editorially maintained services  Use semantic technologies for better search experience  No to heavy-weight, expressive semantic web service languages such as OWL-S or WSML  Yes to simple light-weight semantic annotations in RDF   Scalability! 3© Copyright 2012 SEEKDA GmbH – www.seekda.com
  4. 4. Outline  Web Service search engine - basics  Focused Crawling  WSDL-based services  Web APIs  Seekda‘s search engine & experimental prototype  Crowdsourcing Web Service annotations  Web Service Annotation wizard  Amazon Mechanical Turk crowdsourcing  Service ontologies© Copyright 2012 SEEKDA GmbH – www.seekda.com
  5. 5. Service Location  Locating Web Services on the Web (Approach adopted by European projects Service-Finder & SOA4All)  Crawling the Web for services  Aggregate information  Annotate services  Supported services:  WSDL descriptions  Web APIs (a.k.a. RESTful services) 5© Copyright 2012 SEEKDA GmbH – www.seekda.com
  6. 6. Service Crawler Architecture Crawl Operator Collecting Seeds Configuration & Monitoring Crawling RDF meta-data Data Post-Processing ARCs Index 6© Copyright 2012 SEEKDA GmbH – www.seekda.com
  7. 7. Crawling the Web for Services  Basic crawling process:  Start with a set of seed URLs  Check whether a page should be fetched or not  Fetch the document the URL points to  Extract links from the fetched document  Decide whether or not to store fetched documents  Feed crawler queues with newly extracted links  Assign costs/priorities to single URLs and queues 7© Copyright 2012 SEEKDA GmbH – www.seekda.com
  8. 8. Focused Crawling Techniques  Seed Collection  Collecting seeds from specialized portals  Reuse known Web Service descriptions and related documents  URL Scheduling  Use clever means to prioritize URLs to focus the crawls to the relevant part of the Web  Assign costs that influence the priority of a URL in a queue  Based on:  Building term vectors of pages to assess similarity to WS domain  URL characteristics  Queue Scheduling  One queue per host  Prioritize queues with low-cost URLs 8© Copyright 2012 SEEKDA GmbH – www.seekda.com
  9. 9. Identify WSDLs and Related Information  WSDL identification  Check whether a fetched page is XML and valid WSDL  Related documents identification  Definition of related document  Inlink to the WSDL  Outlink from the WSDL  Associated by term vector similarity  Task split between crawl run-time and post-processing of the crawl data  Task implies the deeper crawling of service provider domains 9© Copyright 2012 SEEKDA GmbH – www.seekda.com
  10. 10. Unique Service Objects  Building unique service objects  Collect all similar WSDLs  deduplication  One service = all WSDLs with same provider and service  Example:  Unique Service: http://seekda.com/providers/cdyne.com/IP2Geo  Endpoint: http://ws.cdyne.com/ip2geo/ip2geo.asmx  Provider: cdyne.com  Service: IP2Geo  WSDLs: http://ws.cdyne.com/ip2geo/ip2geo.asmx?wsdl http://miki2005.uda.ad/p1net/Web%20References/com.cdyne.ws/ip2geo.wsdl ...  Create uniqe service identifiers:  http://seekda.com/providers/<providerName>/<serviceName>  Assemble related information 10© Copyright 2012 SEEKDA GmbH – www.seekda.com
  11. 11. Search Results 11© Copyright 2012 SEEKDA GmbH – www.seekda.com
  12. 12. Service Overview 12© Copyright 2012 SEEKDA GmbH – www.seekda.com
  13. 13. seekda Web Service Search Engine 13© Copyright 2012 SEEKDA GmbH – www.seekda.com
  14. 14. Why crawl for Web APIs?  Significant growth of Web APIs  > 5,400 Web APIs on ProgrammableWeb (including SOAP and REST APIs) [end of 2009: ca. 1,500 Web APIs]  > 6,500 Mashups on ProgrammableWeb (combining Web APIs from one or more sources)  SOAP services are only a small part of the overall available public services 14© Copyright 2012 SEEKDA GmbH – www.seekda.com
  15. 15. Web API – Example (1/3) 15© Copyright 2012 SEEKDA GmbH – www.seekda.com
  16. 16. Web API – Example (2/3) 16© Copyright 2012 SEEKDA GmbH – www.seekda.com
  17. 17. Web API – Example (3/3)  Problem:  Web APIs are described by regular HTML pages  No standardized structure that helps with the identification 17© Copyright 2012 SEEKDA GmbH – www.seekda.com
  18. 18. Web API Identification  Solution: Crawl for Web APIs  Approach 1: Manual Feature Identification Approach  Taking into account HTML structure (e.g., title, mark-up), syntactical properties of used language (e.g., camel-cased words), and link properties of pages (ratio external links / internal links)  Approach 2: Automatic Classification Approach  Text Classification, supervised learning (Support Vector Machine model)  Training set: APIs from ProgrammableWeb 18© Copyright 2012 SEEKDA GmbH – www.seekda.com
  19. 19. Unique Service Objects – Web APIs  Create unique identifiers:  Again using the provider name (from the Web API homepage)  We do not know the service name  hash value of URL instead  http://seekda.com/providers/<providerName>/<hashValueOfURL >  But: still needed human confirmation to be sure 19© Copyright 2012 SEEKDA GmbH – www.seekda.com
  20. 20. New Search Engine Prototype 20© Copyright 2012 SEEKDA GmbH – www.seekda.com
  21. 21. Prototype – User Contributions  Web API – yes/no: confirmation from human needed!  Other annotations that help improve the search for Web Services  Categories  Tags  Natural Language descriptions  Cost: Free or paid service 21© Copyright 2012 SEEKDA GmbH – www.seekda.com
  22. 22. Problem - User Contribution  Problem:  Users/developers don’t contribute enough  Hard to motivate them to provide annotations  Community recognition or peer respect not enough  Solution: crowdsourcing the annotations, pay people to provide annotations  Use Amazon Mechanical Turk  Bootstrap annotations quickly and cheap 22© Copyright 2012 SEEKDA GmbH – www.seekda.com
  23. 23. Service Annotation Wizard (1/4) 23© Copyright 2012 SEEKDA GmbH – www.seekda.com
  24. 24. Service Annotation Wizard (2/4) 24© Copyright 2012 SEEKDA GmbH – www.seekda.com
  25. 25. Service Annotation Wizard (3/4) 25© Copyright 2012 SEEKDA GmbH – www.seekda.com
  26. 26. Service Annotation Wizard (4/4) 26© Copyright 2012 SEEKDA GmbH – www.seekda.com
  27. 27. Amazon Mechanical Turk – Iteration 1 Number of Submissions 70 Reward per task $0.10 Restrictions none  Annotation Wizard  Web API Yes/No  Assign a category  Assign tags  Provide a natural language description  Determine whether page is documentation, pricing or listing  Rate the service 27© Copyright 2012 SEEKDA GmbH – www.seekda.com
  28. 28. Amazon Mechanical Turk – Iteration 1  Results  21 APIs correctly identified as APIs  28 Web documents (non APIs) identified correctly as non APIs  49/70 correctly identified (70% accuracy)  Average task completion time: 2:20 min  But, only:  4 well done & complete annotations  8 acceptable annotations (non complete) 28© Copyright 2012 SEEKDA GmbH – www.seekda.com
  29. 29. Amazon Mechanical Turk – Iterations 2 & 3 Iteration 2 Iteration 3 Number of Submissions 100 150 Reward per task $0.20 $0.20 Restrictions yes yes  Annotation Wizard  Removed page type identification & service rating  For a task to be accepted:  At least one category must be assigned  At least 2 tags must be provided  A meaningful description must be provided 29© Copyright 2012 SEEKDA GmbH – www.seekda.com
  30. 30. Amazon Mechanical Turk – Iteration 2 & 3  Results Iteration 2 & 3:  Ca. 80% of documents correctly identified  Very satisfying annotations  Average completion time: 2:36 min 30© Copyright 2012 SEEKDA GmbH – www.seekda.com
  31. 31. Amazon Mechanical Turk – Survey  48 survey submissions  Female 18, Male 30  Most popular origins: India (27) and USA (9)  Popular age groups:  15-22 (12)  23-30 (18)  31-50 (16)  Most of them worked in some IT profession  Provided best quality annotations 31© Copyright 2012 SEEKDA GmbH – www.seekda.com
  32. 32. Amazon Mechanical Turk  Recommendations for further improvement:  Improve task description, especially ‘what is a Web API’  Better examples (e.g., hinting what makes a false page false)  Allow assignment of multiple categories  Restrict to workers in IT professions?  Conclusion:  Very positive results  good way to get quality annotations  Results will help provide better search experience to users  Results can be used as positive set for automatic classification 32© Copyright 2012 SEEKDA GmbH – www.seekda.com
  33. 33. Service Ontologies (1/2) 33© Copyright 2012 SEEKDA GmbH – www.seekda.com
  34. 34. Service Ontologies (2/2) http://www.service-finder.eu/ontologies/ServiceCategories 34© Copyright 2012 SEEKDA GmbH – www.seekda.com
  35. 35. Questions? 35© Copyright 2012 SEEKDA GmbH – www.seekda.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×