Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data.

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. JSOUP
  2. 2. Overview What is Jsoup Parsing with Url Parsing with File Modify Data Prevent cross site scripting
  3. 3. JSOUP jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, ● scrape and parse HTML from a URL, file, or string ● find and extract data, using DOM traversal or CSS selectors ● manipulate the HTML elements, attributes, and text ● clean user-submitted content against a safe white-list, to prevent XSS attacks ● output tidy HTML
  4. 4. Parse a document from a url The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If an error occurs whilst fetching the URL, it will throw an IOException, which you should handle appropriately. Document document = Jsoup.connect("").get() String title = document.title() .
  5. 5. Continue.. The Connection interface is designed for method chaining to build specific requests: Document doc = Jsoup.connect("") .userAgent("Mozilla") .cookie("auth", "token") .timeout(3000) .post();
  6. 6. Parse a document from a string You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. The String may have come from user input, a file, or from the web. String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);
  7. 7. Load a document from a file File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") String content = document.getElementById(“content”) String tag = document.getElementByTag(“p”) String class = document.getElementByClass(“green”)
  8. 8. Use DOM methods to navigate a document You have a HTML document that you want to extract data from. File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") Elements elements =".nav-sections li") elements.each { element -> String text ="a").text() String attr ="a").attr("href") }
  9. 9. Modify Data Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key, String value). If you need to modify the class attribute of an element, use the Element.addClass(String className) and Element.removeClass(String className) methods. The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow" attribute to every a element inside a div:"div.comments a").attr("rel", "nofollow");"div.masthead").attr("title", "jsoup").addClass("round-box");
  10. 10. Setting the text content of an element Element div ="div").first(); div.html("<p>paragraph</p>"); div.prepend("<p>First</p>"); div.append("<p>Last</p>");
  11. 11. Sanitize untrusted HTML (to prevent XSS) Whitelist allows what are the features that are passed to cleaning and others are discarded. String unsafe ="<p><a href='' onclick='stealCookies()'>Link</a></p>" String safe = Jsoup.clean(unsafe, Whitelist.basic());
  12. 12. Tidy HTML The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles: ● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>) ● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...) ● reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)
  13. 13. Demo Reference