SlideShare a Scribd company logo
1 of 14
Download to read offline
JSOUP
Overview
What is Jsoup
Parsing with Url
Parsing with File
Modify Data
Prevent cross site scripting
JSOUP
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for
extracting and manipulating data,
● scrape and parse HTML from a URL, file, or string
● find and extract data, using DOM traversal or CSS selectors
● manipulate the HTML elements, attributes, and text
● clean user-submitted content against a safe white-list, to prevent XSS attacks
● output tidy HTML
Parse a document from a url
The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If
an error occurs whilst fetching the URL, it will throw an IOException, which you should handle
appropriately.
Document document = Jsoup.connect("https://grails.org/").get()
String title = document.title()
.
Continue..
The Connection interface is designed for method chaining to build specific requests:
Document doc = Jsoup.connect("http://example.com")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Parse a document from a string
You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make
sure it's well formed, or to modify it. The String may have come from user input, a file, or from the
web.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Load a document from a file
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
String content = document.getElementById(“content”)
String tag = document.getElementByTag(“p”)
String class = document.getElementByClass(“green”)
Use DOM methods to navigate a document
You have a HTML document that you want to extract data from.
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
Elements elements = document.select(".nav-sections li")
elements.each { element ->
String text = element.select("a").text()
String attr = element.select("a").attr("href")
}
Modify Data
Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key,
String value).
If you need to modify the class attribute of an element, use the Element.addClass(String className)
and Element.removeClass(String className) methods.
The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow"
attribute to every a element inside a div:
doc.select("div.comments a").attr("rel", "nofollow");
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
Setting the text content of an element
Element div = document.select("div").first();
div.html("<p>paragraph</p>");
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
Sanitize untrusted HTML (to prevent XSS)
Whitelist allows what are the features that are passed to cleaning and others are discarded.
String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"
String safe = Jsoup.clean(unsafe, Whitelist.basic());
Tidy HTML
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of
whether the HTML is well-formed or not. It handles:
● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
● reliably creating the document structure (html containing a head and body, and only
appropriate elements within the head)
Demo Reference
https://github.com/NexThoughts/JSOUP.git
JSOUP Parsing and Modifying HTML Documents

More Related Content

What's hot (20)

Session 20 - Collections - Maps
Session 20 - Collections - MapsSession 20 - Collections - Maps
Session 20 - Collections - Maps
 
09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WP09.Local Database Files and Storage on WP
09.Local Database Files and Storage on WP
 
Xml processors
Xml processorsXml processors
Xml processors
 
Data handling in python
Data handling in pythonData handling in python
Data handling in python
 
Introductionto xslt
Introductionto xsltIntroductionto xslt
Introductionto xslt
 
Mongo db nosql (1)
Mongo db nosql (1)Mongo db nosql (1)
Mongo db nosql (1)
 
XML - SAX
XML - SAXXML - SAX
XML - SAX
 
DOM-XML
DOM-XMLDOM-XML
DOM-XML
 
XML
XMLXML
XML
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Session 5
Session 5Session 5
Session 5
 
Users as Data
Users as DataUsers as Data
Users as Data
 
Cis166 Final Review C#
Cis166 Final Review C#Cis166 Final Review C#
Cis166 Final Review C#
 
Xml processing in scala
Xml processing in scalaXml processing in scala
Xml processing in scala
 
XSL - XML STYLE SHEET
XSL - XML STYLE SHEETXSL - XML STYLE SHEET
XSL - XML STYLE SHEET
 
OData and SharePoint
OData and SharePointOData and SharePoint
OData and SharePoint
 
Chapter iii(working with data)
Chapter iii(working with data)Chapter iii(working with data)
Chapter iii(working with data)
 
Json – java script object notation
Json – java script object notationJson – java script object notation
Json – java script object notation
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 

Viewers also liked (17)

Jmh
JmhJmh
Jmh
 
RESTEasy
RESTEasyRESTEasy
RESTEasy
 
JFree chart
JFree chartJFree chart
JFree chart
 
Spring Web Flow
Spring Web FlowSpring Web Flow
Spring Web Flow
 
Introduction to es6
Introduction to es6Introduction to es6
Introduction to es6
 
Hamcrest
HamcrestHamcrest
Hamcrest
 
Introduction to gradle
Introduction to gradleIntroduction to gradle
Introduction to gradle
 
Grails with swagger
Grails with swaggerGrails with swagger
Grails with swagger
 
Actors model in gpars
Actors model in gparsActors model in gpars
Actors model in gpars
 
Unit test-using-spock in Grails
Unit test-using-spock in GrailsUnit test-using-spock in Grails
Unit test-using-spock in Grails
 
Reactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJavaReactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJava
 
Cosmos DB Service
Cosmos DB ServiceCosmos DB Service
Cosmos DB Service
 
Apache tika
Apache tikaApache tika
Apache tika
 
Progressive Web-App (PWA)
Progressive Web-App (PWA)Progressive Web-App (PWA)
Progressive Web-App (PWA)
 
Java 8 features
Java 8 featuresJava 8 features
Java 8 features
 
Introduction to thymeleaf
Introduction to thymeleafIntroduction to thymeleaf
Introduction to thymeleaf
 
Vertx
VertxVertx
Vertx
 

Similar to JSOUP Parsing and Modifying HTML Documents

Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web TechnologyInternet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web TechnologyAyes Chinmay
 
HTML Foundations, pt 2
HTML Foundations, pt 2HTML Foundations, pt 2
HTML Foundations, pt 2Shawn Calvert
 
Xsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtmlXsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtmlAMIT VIRAMGAMI
 
Apache Utilities At Work V5
Apache Utilities At Work   V5Apache Utilities At Work   V5
Apache Utilities At Work V5Tom Marrs
 
Document Object Model (DOM)
Document Object Model (DOM)Document Object Model (DOM)
Document Object Model (DOM)GOPAL BASAK
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc librarySudip Sengupta
 
jQuery Rescue Adventure
jQuery Rescue AdventurejQuery Rescue Adventure
jQuery Rescue AdventureAllegient
 
PostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL SuperpowersPostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL SuperpowersAmanda Gilmore
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application FrameworkSimon Willison
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQueryGunjan Kumar
 

Similar to JSOUP Parsing and Modifying HTML Documents (20)

Jsoup tutorial
Jsoup tutorialJsoup tutorial
Jsoup tutorial
 
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web TechnologyInternet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
Internet and Web Technology (CLASS-7) [XML and AJAX] | NIC/NIELIT Web Technology
 
HTML Foundations, pt 2
HTML Foundations, pt 2HTML Foundations, pt 2
HTML Foundations, pt 2
 
Chapter 18
Chapter 18Chapter 18
Chapter 18
 
Xsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtmlXsd restrictions, xsl elements, dhtml
Xsd restrictions, xsl elements, dhtml
 
Selenium-Locators
Selenium-LocatorsSelenium-Locators
Selenium-Locators
 
Xml 2
Xml  2 Xml  2
Xml 2
 
2310 b 12
2310 b 122310 b 12
2310 b 12
 
Understanding XML DOM
Understanding XML DOMUnderstanding XML DOM
Understanding XML DOM
 
Apache Utilities At Work V5
Apache Utilities At Work   V5Apache Utilities At Work   V5
Apache Utilities At Work V5
 
Document Object Model (DOM)
Document Object Model (DOM)Document Object Model (DOM)
Document Object Model (DOM)
 
Ajax workshop
Ajax workshopAjax workshop
Ajax workshop
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
 
jQuery Rescue Adventure
jQuery Rescue AdventurejQuery Rescue Adventure
jQuery Rescue Adventure
 
Ajax
AjaxAjax
Ajax
 
Dom
Dom Dom
Dom
 
Xml & Java
Xml & JavaXml & Java
Xml & Java
 
PostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL SuperpowersPostgreSQL's Secret NoSQL Superpowers
PostgreSQL's Secret NoSQL Superpowers
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application Framework
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 

More from NexThoughts Technologies (20)

Alexa skill
Alexa skillAlexa skill
Alexa skill
 
GraalVM
GraalVMGraalVM
GraalVM
 
Docker & kubernetes
Docker & kubernetesDocker & kubernetes
Docker & kubernetes
 
Apache commons
Apache commonsApache commons
Apache commons
 
HazelCast
HazelCastHazelCast
HazelCast
 
MySQL Pro
MySQL ProMySQL Pro
MySQL Pro
 
Microservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & ReduxMicroservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & Redux
 
Swagger
SwaggerSwagger
Swagger
 
Solid Principles
Solid PrinciplesSolid Principles
Solid Principles
 
Arango DB
Arango DBArango DB
Arango DB
 
Jython
JythonJython
Jython
 
Introduction to TypeScript
Introduction to TypeScriptIntroduction to TypeScript
Introduction to TypeScript
 
Smart Contract samples
Smart Contract samplesSmart Contract samples
Smart Contract samples
 
My Doc of geth
My Doc of gethMy Doc of geth
My Doc of geth
 
Geth important commands
Geth important commandsGeth important commands
Geth important commands
 
Ethereum genesis
Ethereum genesisEthereum genesis
Ethereum genesis
 
Ethereum
EthereumEthereum
Ethereum
 
Springboot Microservices
Springboot MicroservicesSpringboot Microservices
Springboot Microservices
 
An Introduction to Redux
An Introduction to ReduxAn Introduction to Redux
An Introduction to Redux
 
Google authentication
Google authenticationGoogle authentication
Google authentication
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

JSOUP Parsing and Modifying HTML Documents

  • 2. Overview What is Jsoup Parsing with Url Parsing with File Modify Data Prevent cross site scripting
  • 3. JSOUP jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, ● scrape and parse HTML from a URL, file, or string ● find and extract data, using DOM traversal or CSS selectors ● manipulate the HTML elements, attributes, and text ● clean user-submitted content against a safe white-list, to prevent XSS attacks ● output tidy HTML
  • 4. Parse a document from a url The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If an error occurs whilst fetching the URL, it will throw an IOException, which you should handle appropriately. Document document = Jsoup.connect("https://grails.org/").get() String title = document.title() .
  • 5. Continue.. The Connection interface is designed for method chaining to build specific requests: Document doc = Jsoup.connect("http://example.com") .userAgent("Mozilla") .cookie("auth", "token") .timeout(3000) .post();
  • 6. Parse a document from a string You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. The String may have come from user input, a file, or from the web. String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);
  • 7. Load a document from a file File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") String content = document.getElementById(“content”) String tag = document.getElementByTag(“p”) String class = document.getElementByClass(“green”)
  • 8. Use DOM methods to navigate a document You have a HTML document that you want to extract data from. File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") Elements elements = document.select(".nav-sections li") elements.each { element -> String text = element.select("a").text() String attr = element.select("a").attr("href") }
  • 9. Modify Data Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key, String value). If you need to modify the class attribute of an element, use the Element.addClass(String className) and Element.removeClass(String className) methods. The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow" attribute to every a element inside a div: doc.select("div.comments a").attr("rel", "nofollow"); doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
  • 10. Setting the text content of an element Element div = document.select("div").first(); div.html("<p>paragraph</p>"); div.prepend("<p>First</p>"); div.append("<p>Last</p>");
  • 11. Sanitize untrusted HTML (to prevent XSS) Whitelist allows what are the features that are passed to cleaning and others are discarded. String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>" String safe = Jsoup.clean(unsafe, Whitelist.basic());
  • 12. Tidy HTML The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles: ● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>) ● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...) ● reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)