Working with WebSPHINX Web Crawler

4,880 views

Published on

How to work on WebSPHINX Web Crawler's Crawler Workbench.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,880
On SlideShare
0
From Embeds
0
Number of Embeds
903
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Working with WebSPHINX Web Crawler

  1. 1. Working With WebSPHINX Web Crawler
  2. 2. Step 1: Run “Java -jar webshinx.jar” in the command prompt
  3. 3. This opens the WebSPHINX GUI
  4. 4. Step 2: Specifying Starting URL (http://www.google.com here)
  5. 5. Step 3: Specify the action to be taken There exist five options: 1) Save: We can save pages to a specified directory 2) Concatenate: We can concatenate the results into a single page, mainly used for printing the search results 3) Extract: We can extract a specific object type, but need to provide HTML code for the same. Eg- <a>(?{logo}<img>)<p>({caption})</a> allows us to extract images 4) Highlight: Results are highlighted in a specific colour (default: blue) 5) Script: (We couldn't try this out as no scripts were available to us)
  6. 6. Step 4: Select the visualization mode 1) Graph 2) Outline 3) Statistics
  7. 7. Action to be taken Save
  8. 8. Action: Save
  9. 9. As the Myresults folder was specified as the save location, history and index can be seen here
  10. 10. Files associated with the URL can be seen here
  11. 11. Action to be taken Concatenate
  12. 12. Action: Concatenate
  13. 14. Action to be taken Extract
  14. 15. Action: Extract
  15. 17. Action to be taken Highlight
  16. 18. Action: Highlight Highlight colour: Blue (default)
  17. 20. Crawling Visualization Graph Mode
  18. 21. Starting URL or Root of the tree
  19. 22. Root of the tree Son the root Son of the previous son of the root
  20. 23. In this crawler, this process is continues till either all the links have been retrieved or no more hard disk space is available.
  21. 24. The number of links crawled keeps on increasing
  22. 25. We can even see the URL associated with the page
  23. 26. Crawling visualization OutlineMode
  24. 27. The outline mode displays the flow in the crawling process, i.e. it shows how the process is proceeding by displaying the nesting of URLs
  25. 28. We can clearly see increase in the number of pages crawled
  26. 29. This screenshot displays the increase in the number of home pages as well
  27. 30. Crawling visualization Statistics Mode
  28. 33. Now showing the process in command prompt

×