SlideShare a Scribd company logo
Palakorn Nakphong
Founder: Nextzy Technologies Co.,ltd.
[“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer];
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Jsoup
Java HTML Parser
Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Complex DOM Element
Old Web Scraping
How to get data in tag?
Regular expression is F*uk
String expr = "<td><spans+class="flagicon"[^>]*>"
+ ".*?</span><a href=""
+ "([^"]+)" // first piece of data goes up to quote
+ ""[^>]*>" // end quote, then skip to end of tag
+ "([^<]+)" // name is data up to next tag
+ "</a>.*?</td>"; // end a tag, then skip to the td close tag
New Web Scraping
Using Jsoup
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.1</version>
</dependency>
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
WhatisJSoupLibrary?
• Jsoup can scrape and parse HTML from a URL, file, or string
• Jsoup can find and extract data, using DOM traversal or CSS selectors
• Jsoup allows you to manipulate the HTML elements, attributes, and text
• Jsoup provides clean user-submitted content against a safe white-list, to
prevent XSS attacks
• Jsoup also output tidy HTML
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Example DOM Element
Document doc = Jsoup.connect("http://www.nextzy.com/").get();
String title = doc.title();
<html>
<head>
<title>My title</title>
</head>
<body>
<h1>My header</h1>
<a href="test.html">My link</a>
</body>
</html>
File input = new File("/file/nextzy.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Get Element By …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultLinks = doc.select("h3.active > a");
Like CSS Selector …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/“
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
Work with URL …
fb.com/codingz @Codingz th.linkedin.com/in/palakorn
มาร่วมเป็นโจรสลัดกับเรา...
https://www.blognone.com/node/64996
Thanks You
Nextzy Technologies Co.,ltd. Jsoup

More Related Content

What's hot

Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
Snehil Verma
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
Diego Valerio Camarda
 
Command line Data Tools
Command line Data ToolsCommand line Data Tools
Command line Data Tools
Peter Wang
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
orangain
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
guest70f525
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
marksoper
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
Alexandra Johnson
 
Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"
National Information Standards Organization (NISO)
 
Web Scrapping with Python
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with Python
Miguel Miranda de Mattos
 
Intro to XML in libraries
Intro to XML in librariesIntro to XML in libraries
Intro to XML in libraries
Kyle Banerjee
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
Nicholas Crouch
 
Web Browsers And Other Mistakes
Web Browsers And Other MistakesWeb Browsers And Other Mistakes
Web Browsers And Other Mistakes
kuza55
 
Keynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C eventKeynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C event
Diego Valerio Camarda
 
Introduction to the DOM
Introduction to the DOMIntroduction to the DOM
Introduction to the DOM
tharaa abu ashour
 
INTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREEINTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREE
systematiclab
 
Test Slide
Test SlideTest Slide
Test Slide
NDeannaPenny
 
CouchDB in The Room
CouchDB in The RoomCouchDB in The Room
CouchDB in The Room
Makoto Ohnami
 
Perl behind the Wall
Perl behind the Wall Perl behind the Wall
Perl behind the Wall
Andrew Shitov
 
fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14
N Masahiro
 
Ruby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebRuby on Rails and the Semantic Web
Ruby on Rails and the Semantic Web
Nathalie Steinmetz
 

What's hot (20)

Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
 
Command line Data Tools
Command line Data ToolsCommand line Data Tools
Command line Data Tools
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
 
Nlp Apis
Nlp ApisNlp Apis
Nlp Apis
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
 
Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"Teets, "NISO Next Generation Discovery"
Teets, "NISO Next Generation Discovery"
 
Web Scrapping with Python
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with Python
 
Intro to XML in libraries
Intro to XML in librariesIntro to XML in libraries
Intro to XML in libraries
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
 
Web Browsers And Other Mistakes
Web Browsers And Other MistakesWeb Browsers And Other Mistakes
Web Browsers And Other Mistakes
 
Keynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C eventKeynote session - LOD2014 W3C event
Keynote session - LOD2014 W3C event
 
Introduction to the DOM
Introduction to the DOMIntroduction to the DOM
Introduction to the DOM
 
INTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREEINTRODUCTION TO DOM AND DOM TREE
INTRODUCTION TO DOM AND DOM TREE
 
Test Slide
Test SlideTest Slide
Test Slide
 
CouchDB in The Room
CouchDB in The RoomCouchDB in The Room
CouchDB in The Room
 
Perl behind the Wall
Perl behind the Wall Perl behind the Wall
Perl behind the Wall
 
fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14
 
Ruby on Rails and the Semantic Web
Ruby on Rails and the Semantic WebRuby on Rails and the Semantic Web
Ruby on Rails and the Semantic Web
 

Viewers also liked

IT Outsource Meetup
IT Outsource MeetupIT Outsource Meetup
IT Outsource Meetup
Nextzy Technologies Co.,ltd
 
Bangkok university Speaker
Bangkok university SpeakerBangkok university Speaker
Bangkok university Speaker
Nextzy Technologies Co.,ltd
 
Numbers
NumbersNumbers
Numbers
JOEMCGEE
 
Nextzy Office Environment
Nextzy Office EnvironmentNextzy Office Environment
Nextzy Office Environment
Nextzy Technologies Co.,ltd
 
Spring
SpringSpring
ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%Nextzy Technologies Co.,ltd
 
Nextzy Technologies Company profile
Nextzy Technologies Company profileNextzy Technologies Company profile
Nextzy Technologies Company profile
Nextzy Technologies Co.,ltd
 

Viewers also liked (7)

IT Outsource Meetup
IT Outsource MeetupIT Outsource Meetup
IT Outsource Meetup
 
Bangkok university Speaker
Bangkok university SpeakerBangkok university Speaker
Bangkok university Speaker
 
Numbers
NumbersNumbers
Numbers
 
Nextzy Office Environment
Nextzy Office EnvironmentNextzy Office Environment
Nextzy Office Environment
 
Spring
SpringSpring
Spring
 
ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%ระบบชุมชนออนไลน์ 100%
ระบบชุมชนออนไลน์ 100%
 
Nextzy Technologies Company profile
Nextzy Technologies Company profileNextzy Technologies Company profile
Nextzy Technologies Company profile
 

Similar to Nextzy Technologies Co.,ltd. Jsoup

Ruby Isn't Just About Rails
Ruby Isn't Just About RailsRuby Isn't Just About Rails
Ruby Isn't Just About Rails
Adam Wiggins
 
Javazone 2010-lift-framework-public
Javazone 2010-lift-framework-publicJavazone 2010-lift-framework-public
Javazone 2010-lift-framework-public
Timothy Perrett
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
Positive Hack Days
 
Native Phone Development 101
Native Phone Development 101Native Phone Development 101
Native Phone Development 101
Sasmito Adibowo
 
Xml
XmlXml
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
jnewmanux
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
alexsaves
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
BG Java EE Course
 
4 JVM Web Frameworks
4 JVM Web Frameworks4 JVM Web Frameworks
4 JVM Web Frameworks
Joe Kutner
 
Html5 and web technology update
Html5 and web technology updateHtml5 and web technology update
Html5 and web technology update
Doug Domeny
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
Ramamohan Chokkam
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
granicz
 
Presentation of JSConf.eu
Presentation of JSConf.euPresentation of JSConf.eu
Presentation of JSConf.eu
Fredrik Wendt
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
 
Javascript Templating
Javascript TemplatingJavascript Templating
Javascript Templating
bcruhl
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
 
Adventurous Merb
Adventurous MerbAdventurous Merb
Adventurous Merb
Matt Todd
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
Karl-Henry Martinsson
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
HTML5
HTML5HTML5

Similar to Nextzy Technologies Co.,ltd. Jsoup (20)

Ruby Isn't Just About Rails
Ruby Isn't Just About RailsRuby Isn't Just About Rails
Ruby Isn't Just About Rails
 
Javazone 2010-lift-framework-public
Javazone 2010-lift-framework-publicJavazone 2010-lift-framework-public
Javazone 2010-lift-framework-public
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
 
Native Phone Development 101
Native Phone Development 101Native Phone Development 101
Native Phone Development 101
 
Xml
XmlXml
Xml
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
4 JVM Web Frameworks
4 JVM Web Frameworks4 JVM Web Frameworks
4 JVM Web Frameworks
 
Html5 and web technology update
Html5 and web technology updateHtml5 and web technology update
Html5 and web technology update
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Presentation of JSConf.eu
Presentation of JSConf.euPresentation of JSConf.eu
Presentation of JSConf.eu
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
 
Javascript Templating
Javascript TemplatingJavascript Templating
Javascript Templating
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Adventurous Merb
Adventurous MerbAdventurous Merb
Adventurous Merb
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
HTML5
HTML5HTML5
HTML5
 

Recently uploaded

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
FODUU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 

Recently uploaded (20)

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 

Nextzy Technologies Co.,ltd. Jsoup

  • 1. Palakorn Nakphong Founder: Nextzy Technologies Co.,ltd. [“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer]; fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 2. Jsoup Java HTML Parser Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 5. How to get data in tag?
  • 6. Regular expression is F*uk String expr = "<td><spans+class="flagicon"[^>]*>" + ".*?</span><a href="" + "([^"]+)" // first piece of data goes up to quote + ""[^>]*>" // end quote, then skip to end of tag + "([^<]+)" // name is data up to next tag + "</a>.*?</td>"; // end a tag, then skip to the td close tag
  • 10. • Jsoup can scrape and parse HTML from a URL, file, or string • Jsoup can find and extract data, using DOM traversal or CSS selectors • Jsoup allows you to manipulate the HTML elements, attributes, and text • Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks • Jsoup also output tidy HTML fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 11. Example DOM Element Document doc = Jsoup.connect("http://www.nextzy.com/").get(); String title = doc.title(); <html> <head> <title>My title</title> </head> <body> <h1>My header</h1> <a href="test.html">My link</a> </body> </html>
  • 12. File input = new File("/file/nextzy.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/"); Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); } Get Element By … fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 13. Elements links = doc.select("a[href]"); Elements pngs = doc.select("img[src$=.png]"); Element masthead = doc.select("div.masthead").first(); Elements resultLinks = doc.select("h3.active > a"); Like CSS Selector … fb.com/codingz @Codingz th.linkedin.com/in/palakorn
  • 14. Document doc = Jsoup.connect("http://jsoup.org").get(); Element link = doc.select("a").first(); String relHref = link.attr("href"); // == "/“ String absHref = link.attr("abs:href"); // "http://jsoup.org/" Work with URL … fb.com/codingz @Codingz th.linkedin.com/in/palakorn