Your SlideShare is downloading. ×
Working with WebSPHINX Web Crawler
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Working with WebSPHINX Web Crawler

3,331
views

Published on

How to work on WebSPHINX Web Crawler's Crawler Workbench. …

How to work on WebSPHINX Web Crawler's Crawler Workbench.

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,331
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Working With WebSPHINX Web Crawler
  • 2. Step 1: Run “Java -jar webshinx.jar” in the command prompt
  • 3. This opens the WebSPHINX GUI
  • 4. Step 2: Specifying Starting URL (http://www.google.com here)
  • 5. Step 3: Specify the action to be taken There exist five options: 1) Save: We can save pages to a specified directory 2) Concatenate: We can concatenate the results into a single page, mainly used for printing the search results 3) Extract: We can extract a specific object type, but need to provide HTML code for the same. Eg- <a>(?{logo}<img>)<p>({caption})</a> allows us to extract images 4) Highlight: Results are highlighted in a specific colour (default: blue) 5) Script: (We couldn't try this out as no scripts were available to us)
  • 6. Step 4: Select the visualization mode 1) Graph 2) Outline 3) Statistics
  • 7. Action to be taken Save
  • 8. Action: Save
  • 9. As the Myresults folder was specified as the save location, history and index can be seen here
  • 10. Files associated with the URL can be seen here
  • 11. Action to be taken Concatenate
  • 12. Action: Concatenate
  • 13.  
  • 14. Action to be taken Extract
  • 15. Action: Extract
  • 16.  
  • 17. Action to be taken Highlight
  • 18. Action: Highlight Highlight colour: Blue (default)
  • 19.  
  • 20. Crawling Visualization Graph Mode
  • 21. Starting URL or Root of the tree
  • 22. Root of the tree Son the root Son of the previous son of the root
  • 23. In this crawler, this process is continues till either all the links have been retrieved or no more hard disk space is available.
  • 24. The number of links crawled keeps on increasing
  • 25. We can even see the URL associated with the page
  • 26. Crawling visualization OutlineMode
  • 27. The outline mode displays the flow in the crawling process, i.e. it shows how the process is proceeding by displaying the nesting of URLs
  • 28. We can clearly see increase in the number of pages crawled
  • 29. This screenshot displays the increase in the number of home pages as well
  • 30. Crawling visualization Statistics Mode
  • 31.  
  • 32.  
  • 33. Now showing the process in command prompt
  • 34.  
  • 35.  
  • 36.