SlideShare a Scribd company logo
Palakorn Nakphong
Founder: Nextzy Technologies Co.,ltd.
[“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer];
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Jsoup
Java HTML Parser
Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Complex DOM Element
Old Web Scraping
How to get data in tag?
Regular expression is F*uk
String expr = "<td><spans+class="flagicon"[^>]*>"
+ ".*?</span><a href=""
+ "([^"]+)" // first piece of data goes up to quote
+ ""[^>]*>" // end quote, then skip to end of tag
+ "([^<]+)" // name is data up to next tag
+ "</a>.*?</td>"; // end a tag, then skip to the td close tag
New Web Scraping
Using Jsoup
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.1</version>
</dependency>
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
WhatisJSoupLibrary?
• Jsoup can scrape and parse HTML from a URL, file, or string
• Jsoup can find and extract data, using DOM traversal or CSS selectors
• Jsoup allows you to manipulate the HTML elements, attributes, and text
• Jsoup provides clean user-submitted content against a safe white-list, to
prevent XSS attacks
• Jsoup also output tidy HTML
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Example DOM Element
Document doc = Jsoup.connect("http://www.nextzy.com/").get();
String title = doc.title();
<html>
<head>
<title>My title</title>
</head>
<body>
<h1>My header</h1>
<a href="test.html">My link</a>
</body>
</html>
File input = new File("/file/nextzy.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Get Element By …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultLinks = doc.select("h3.active > a");
Like CSS Selector …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/“
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
Work with URL …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
มาร่วมเป็นโจรสลัดกับเรา...
https://www.blognone.com/node/64996
Thanks You
Nextzy Technologies Co.,ltd. Jsoup

More Related Content

What's hot

Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
Snehil Verma
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
Diego Valerio Camarda
 
Command line Data Tools
Command line Data ToolsCommand line Data Tools
Command line Data Tools
Peter Wang
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
orangain
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
guest70f525
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
marksoper
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
Alexandra Johnson
 
Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"
National Information Standards Organization (NISO)
 
Web Scrapping with Python
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with Python
Miguel Miranda de Mattos
 
Intro to XML in libraries
Intro to XML in librariesIntro to XML in libraries
Intro to XML in libraries
Kyle Banerjee
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
Nicholas Crouch
 
Web Browsers And Other Mistakes
Web Browsers And Other MistakesWeb Browsers And Other Mistakes
Web Browsers And Other Mistakes
kuza55
 
Keynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C eventKeynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C event
Diego Valerio Camarda
 
Introduction to the DOM
Introduction to the DOMIntroduction to the DOM
Introduction to the DOM
tharaa abu ashour
 
INTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREEINTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREE
systematiclab
 
Test Slide
Test SlideTest Slide
Test Slide
NDeannaPenny
 
CouchDB in The Room
CouchDB in The RoomCouchDB in The Room
CouchDB in The Room
Makoto Ohnami
 
Perl behind the Wall
Perl behind the Wall Perl behind the Wall
Perl behind the Wall
Andrew Shitov
 
fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14
N Masahiro
 
Ruby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebRuby on Rails and the Semantic Web
Ruby on Rails and the Semantic Web
Nathalie Steinmetz
 

What's hot (20)

Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
 
Command line Data Tools
Command line Data ToolsCommand line Data Tools
Command line Data Tools
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
 
Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"
 
Web Scrapping with Python
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with Python
 
Intro to XML in libraries
Intro to XML in librariesIntro to XML in libraries
Intro to XML in libraries
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
 
Web Browsers And Other Mistakes
Web Browsers And Other MistakesWeb Browsers And Other Mistakes
Web Browsers And Other Mistakes
 
Keynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C eventKeynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C event
 
Introduction to the DOM
Introduction to the DOMIntroduction to the DOM
Introduction to the DOM
 
INTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREEINTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREE
 
Test Slide
Test SlideTest Slide
Test Slide
 
CouchDB in The Room
CouchDB in The RoomCouchDB in The Room
CouchDB in The Room
 
Perl behind the Wall
Perl behind the Wall Perl behind the Wall
Perl behind the Wall
 
fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14
 
Ruby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebRuby on Rails and the Semantic Web
Ruby on Rails and the Semantic Web
 

Viewers also liked

IT Outsource Meetup
IT Outsource MeetupIT Outsource Meetup
IT Outsource Meetup
Nextzy Technologies Co.,ltd
 
Bangkok university Speaker
Bangkok university SpeakerBangkok university Speaker
Bangkok university Speaker
Nextzy Technologies Co.,ltd
 
Numbers
NumbersNumbers
Numbers
JOEMCGEE
 
Nextzy Office Environment
Nextzy Office EnvironmentNextzy Office Environment
Nextzy Office Environment
Nextzy Technologies Co.,ltd
 
Spring
SpringSpring
ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%Nextzy Technologies Co.,ltd
 
Nextzy Technologies Company profile
Nextzy Technologies Company profileNextzy Technologies Company profile
Nextzy Technologies Company profile
Nextzy Technologies Co.,ltd
 

Viewers also liked (7)

IT Outsource Meetup
IT Outsource MeetupIT Outsource Meetup
IT Outsource Meetup
 
Bangkok university Speaker
Bangkok university SpeakerBangkok university Speaker
Bangkok university Speaker
 
Numbers
NumbersNumbers
Numbers
 
Nextzy Office Environment
Nextzy Office EnvironmentNextzy Office Environment
Nextzy Office Environment
 
Spring
SpringSpring
Spring
 
ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%
 
Nextzy Technologies Company profile
Nextzy Technologies Company profileNextzy Technologies Company profile
Nextzy Technologies Company profile
 

Similar to Nextzy Technologies Co.,ltd. Jsoup

Ruby Isn't Just About Rails
Ruby Isn't Just About RailsRuby Isn't Just About Rails
Ruby Isn't Just About Rails
Adam Wiggins
 
Javazone 2010-lift-framework-public
Javazone 2010-lift-framework-publicJavazone 2010-lift-framework-public
Javazone 2010-lift-framework-public
Timothy Perrett
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
Positive Hack Days
 
Native Phone Development 101
Native Phone Development 101Native Phone Development 101
Native Phone Development 101
Sasmito Adibowo
 
Xml
XmlXml
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
jnewmanux
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
alexsaves
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
BG Java EE Course
 
4 JVM Web Frameworks
4 JVM Web Frameworks4 JVM Web Frameworks
4 JVM Web Frameworks
Joe Kutner
 
Html5 and web technology update
Html5 and web technology updateHtml5 and web technology update
Html5 and web technology update
Doug Domeny
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
Ramamohan Chokkam
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
granicz
 
Presentation of JSConf.eu
Presentation of JSConf.euPresentation of JSConf.eu
Presentation of JSConf.eu
Fredrik Wendt
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
 
Javascript Templating
Javascript TemplatingJavascript Templating
Javascript Templating
bcruhl
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
 
Adventurous Merb
Adventurous MerbAdventurous Merb
Adventurous Merb
Matt Todd
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
Karl-Henry Martinsson
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
HTML5
HTML5HTML5

Similar to Nextzy Technologies Co.,ltd. Jsoup (20)

Ruby Isn't Just About Rails
Ruby Isn't Just About RailsRuby Isn't Just About Rails
Ruby Isn't Just About Rails
 
Javazone 2010-lift-framework-public
Javazone 2010-lift-framework-publicJavazone 2010-lift-framework-public
Javazone 2010-lift-framework-public
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
 
Native Phone Development 101
Native Phone Development 101Native Phone Development 101
Native Phone Development 101
 
Xml
XmlXml
Xml
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
4 JVM Web Frameworks
4 JVM Web Frameworks4 JVM Web Frameworks
4 JVM Web Frameworks
 
Html5 and web technology update
Html5 and web technology updateHtml5 and web technology update
Html5 and web technology update
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Presentation of JSConf.eu
Presentation of JSConf.euPresentation of JSConf.eu
Presentation of JSConf.eu
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
 
Javascript Templating
Javascript TemplatingJavascript Templating
Javascript Templating
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Adventurous Merb
Adventurous MerbAdventurous Merb
Adventurous Merb
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
HTML5
HTML5HTML5
HTML5
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 

Nextzy Technologies Co.,ltd. Jsoup

  • 1. Palakorn Nakphong Founder: Nextzy Technologies Co.,ltd. [“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer]; fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 2. Jsoup Java HTML Parser Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 5. How to get data in tag?
  • 6. Regular expression is F*uk String expr = "<td><spans+class="flagicon"[^>]*>" + ".*?</span><a href="" + "([^"]+)" // first piece of data goes up to quote + ""[^>]*>" // end quote, then skip to end of tag + "([^<]+)" // name is data up to next tag + "</a>.*?</td>"; // end a tag, then skip to the td close tag
  • 10. • Jsoup can scrape and parse HTML from a URL, file, or string • Jsoup can find and extract data, using DOM traversal or CSS selectors • Jsoup allows you to manipulate the HTML elements, attributes, and text • Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks • Jsoup also output tidy HTML fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 11. Example DOM Element Document doc = Jsoup.connect("http://www.nextzy.com/").get(); String title = doc.title(); <html> <head> <title>My title</title> </head> <body> <h1>My header</h1> <a href="test.html">My link</a> </body> </html>
  • 12. File input = new File("/file/nextzy.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/"); Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); } Get Element By … fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 13. Elements links = doc.select("a[href]"); Elements pngs = doc.select("img[src$=.png]"); Element masthead = doc.select("div.masthead").first(); Elements resultLinks = doc.select("h3.active > a"); Like CSS Selector … fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 14. Document doc = Jsoup.connect("http://jsoup.org").get(); Element link = doc.select("a").first(); String relHref = link.attr("href"); // == "/“ String absHref = link.attr("abs:href"); // "http://jsoup.org/" Work with URL … fb.com/codingz @Codingz th.linkedin.com/in/palakorn