SlideShare a Scribd company logo
1 of 56
Download to read offline
Attacking fire with fire
Or how to get an API from any website
I am Danielius Visockas
#givingBackToCommunity
Salut!
Web harvesting
Web harvesting
Go to a
page
Extract the
data
Download a
document
Basic diagram of web harvesting
Fundamental metrics
◉ Freshness
◉ Age
Revisiting policy
Constant Based on freshness
“
Edward Coffman et. al. proposed that
a crawler must minimize the fraction
of time pages remain outdated.
Aaah, easy
curl -i https://delfi.lt
No SSL...
curl -i http://delfi.lt
Doesn’t work....
Let’s try mobile
curl -i http://m.delfi.lt
……………….
<script
type="text/javascript">(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(argume
ts)},i[r].l=1*new
Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,'scr
pt','//www.google-analytics.com/analytics.js','ga');ga('create','UA-2428893-5','auto');var
__ae=document.getElementsByClassName('delfi-author-name')[0]||document.getElementsByClassName('article-author-name')[0];if('undefi
ed' !== typeof __ae){var
au=__ae.textContent;au=au.replace(/[,;].*/g,'');au=au.replace(/^s+|s+$/g,'');au=au.toLowerCase();ga('set','dimension1',au);}if(m
= navigator.userAgent.match(/Delfi/([0-9.]+)/)){var ua='Other';if(/ip(hone|ad|od)/i.test(navigator.userAgent))ua='iOS App';else
if(/android/i.test(navigator.userAgent))ua='Android App';else if(/(windows|msie)/i.test(navigator.userAgent))ua='Windows
App';ga('set','dimension2',ua);}else if(/FBAV//.test(navigator.userAgent))ga('set','dimension2','FBWV');else
ga('set','dimension2','Browser');ga('set','dimension3',''+(window.__dabd && window.__dabd()));ga('send','pageview');
</script>
<script type="text/javascript">var
t=window.location.hostname.split('.').reverse();if(window._dct)_dct({s:'delfi/mobile',d:'t.'+t[1]+'.'+t[0]});</script>
<script type="text/javascript">
var
__ae=document.getElementsByClassName('delfi-author-name')[0]||document.getElementsByClassName('article-author-name')[0],au='',_sf_
sync_config = {};
if('undefined' !== typeof __ae){var
au=__ae.textContent;au=au.replace(/,.*/g,'');au=au.replace(/^s+|s+$/g,'');au=au.toLowerCase();}
_sf_async_config.uid=46335;_sf_async_config.domain='delfi.lt';_sf_async_config.sections='m.delfi';_sf_async_config.authors=au;_sf_
sync_config.useCanonical=true;
(function(){function loadChartbeat(){window._sf_endpt=(new Date()).getTime();var
e=document.createElement('script');e.setAttribute('language', 'javascript');e.setAttribute('type',
'text/javascript');e.setAttribute('src', '//static.chartbeat.com/js/chartbeat.js');document.body.appendChild(e);}
var oldonload=window.onload; window.onload=(typeof window.onload != 'function') ? loadChartbeat : function() { oldonload();
loadChartbeat(); };
})();
</script>
<script
This looks familiar
Let’s use regex and it should be fine
Overengineering
Basic techniques
Pick a right tool for the job
One-time
Your computer is on
Two ways to harvest
Automated
Can be done in a server
Copy and paste
Client-side scripting
Extensions and bookmarks!
Online scrapers
The fun part
Automated scraping
“
Don’t forget to watch the network tab
Fetching of websites
Extraction of data
Cheerio
But then it all changed
When fire nation attacked
I found a girl in Kaunas...
7 seconds
Traukiniobilietas.lt response time
Thats
Five
Seconds
More
Than
It
Takes
To
Say
Seven
Seconds
Screenshot
Traukiniobilietas.lt didn’t load...
So I decided to learn React
And built an app that helps you to find trips
Want big impact?
Use big image.
How do I get the Data?!
Headless browsers
Brings together the best
Of two worlds
I used Casper.js
◉ Runs on PhantomJS
◉ Resource intensive
◉ Can replicate everything
◉ Takes a bit longer
◉ DoS’ed traukiniobilietas…
◉ Works
“
So basically
You have to pick
The right tool for the job
#noFreeLunchTheory
Legal stuff
Security
CAPTCHAS and friends...
Interesting ideas
◉ Visual scraping using Machine Learning
◉ Macros + Casper.js (github.com/dvisockas/scrape)
Please ask questions!
Thank you!
And if someone from TRAFI could help me with traveling salesman..

More Related Content

Viewers also liked

CoinTelegraph's inbound marketing audit
CoinTelegraph's inbound marketing auditCoinTelegraph's inbound marketing audit
CoinTelegraph's inbound marketing auditCoinTelegraph
 
CH&Cie Cyber Security - CIB - Teaser
CH&Cie Cyber Security - CIB - TeaserCH&Cie Cyber Security - CIB - Teaser
CH&Cie Cyber Security - CIB - TeaserNadia Lamchachti
 
Redes sociales
Redes socialesRedes sociales
Redes socialesZusu3127
 
Clark Mulhern Construction - Sep 2016
Clark Mulhern Construction - Sep 2016Clark Mulhern Construction - Sep 2016
Clark Mulhern Construction - Sep 2016Sean Clark
 
GoKart Shopping
GoKart ShoppingGoKart Shopping
GoKart ShoppingGKS2015
 

Viewers also liked (10)

GPS-Graphic-Library (1)
GPS-Graphic-Library (1)GPS-Graphic-Library (1)
GPS-Graphic-Library (1)
 
CoinTelegraph's inbound marketing audit
CoinTelegraph's inbound marketing auditCoinTelegraph's inbound marketing audit
CoinTelegraph's inbound marketing audit
 
CH&Cie Cyber Security - CIB - Teaser
CH&Cie Cyber Security - CIB - TeaserCH&Cie Cyber Security - CIB - Teaser
CH&Cie Cyber Security - CIB - Teaser
 
Redes sociales
Redes socialesRedes sociales
Redes sociales
 
Clark Mulhern Construction - Sep 2016
Clark Mulhern Construction - Sep 2016Clark Mulhern Construction - Sep 2016
Clark Mulhern Construction - Sep 2016
 
Experience
ExperienceExperience
Experience
 
DB
DBDB
DB
 
GoKart Shopping
GoKart ShoppingGoKart Shopping
GoKart Shopping
 
Histria apart intro
Histria apart introHistria apart intro
Histria apart intro
 
Study on the effect of pollution on some vegetable crops
Study on the effect of pollution on some vegetable  cropsStudy on the effect of pollution on some vegetable  crops
Study on the effect of pollution on some vegetable crops
 

Similar to Vilnius.js

HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015Christian Heilmann
 
Get Ahead with HTML5 on Moible
Get Ahead with HTML5 on MoibleGet Ahead with HTML5 on Moible
Get Ahead with HTML5 on Moiblemarkuskobler
 
Measuring User Experience in the Browser
Measuring User Experience in the BrowserMeasuring User Experience in the Browser
Measuring User Experience in the BrowserAlois Reitbauer
 
Monitor all the cloud things - security monitoring for everyone
Monitor all the cloud things - security monitoring for everyoneMonitor all the cloud things - security monitoring for everyone
Monitor all the cloud things - security monitoring for everyoneDuncan Godfrey
 
Drupal Camp Atlanta 2011 - Drupal Security
Drupal Camp Atlanta 2011 - Drupal SecurityDrupal Camp Atlanta 2011 - Drupal Security
Drupal Camp Atlanta 2011 - Drupal SecurityMediacurrent
 
Measuring User Experience
Measuring User ExperienceMeasuring User Experience
Measuring User ExperienceAlois Reitbauer
 
Being a tweaker modern web performance techniques
Being a tweaker   modern web performance techniquesBeing a tweaker   modern web performance techniques
Being a tweaker modern web performance techniquesChris Love
 
Library Program Technology in Ukraine & Romania
Library Program Technology in Ukraine & RomaniaLibrary Program Technology in Ukraine & Romania
Library Program Technology in Ukraine & RomaniaMark Belinsky
 
Cloud Serving Engine
Cloud Serving EngineCloud Serving Engine
Cloud Serving Enginesureddy
 
#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup PerformanceJustin Cataldo
 
The Theory Of The Dom
The Theory Of The DomThe Theory Of The Dom
The Theory Of The Domkaven yan
 
The High Performance Web Application Lifecycle
The High Performance Web Application LifecycleThe High Performance Web Application Lifecycle
The High Performance Web Application LifecycleAlois Reitbauer
 
HTML5 and CSS3 Shizzle
HTML5 and CSS3 ShizzleHTML5 and CSS3 Shizzle
HTML5 and CSS3 ShizzleChris Mills
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youPatrick Meenan
 
Derbycon 8 - We Are the Artillery: Using Google Fu to Take Down the Grid
Derbycon 8 - We Are the Artillery: Using Google Fu to Take Down the GridDerbycon 8 - We Are the Artillery: Using Google Fu to Take Down the Grid
Derbycon 8 - We Are the Artillery: Using Google Fu to Take Down the GridChris Sistrunk
 
Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?) Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?) SOASTA
 

Similar to Vilnius.js (20)

HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015
 
Get Ahead with HTML5 on Moible
Get Ahead with HTML5 on MoibleGet Ahead with HTML5 on Moible
Get Ahead with HTML5 on Moible
 
Measuring User Experience in the Browser
Measuring User Experience in the BrowserMeasuring User Experience in the Browser
Measuring User Experience in the Browser
 
Monitor all the cloud things - security monitoring for everyone
Monitor all the cloud things - security monitoring for everyoneMonitor all the cloud things - security monitoring for everyone
Monitor all the cloud things - security monitoring for everyone
 
Drupal Camp Atlanta 2011 - Drupal Security
Drupal Camp Atlanta 2011 - Drupal SecurityDrupal Camp Atlanta 2011 - Drupal Security
Drupal Camp Atlanta 2011 - Drupal Security
 
Measuring User Experience
Measuring User ExperienceMeasuring User Experience
Measuring User Experience
 
Being a tweaker modern web performance techniques
Being a tweaker   modern web performance techniquesBeing a tweaker   modern web performance techniques
Being a tweaker modern web performance techniques
 
Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!
 
Library Program Technology in Ukraine & Romania
Library Program Technology in Ukraine & RomaniaLibrary Program Technology in Ukraine & Romania
Library Program Technology in Ukraine & Romania
 
Cloud Serving Engine
Cloud Serving EngineCloud Serving Engine
Cloud Serving Engine
 
Labs_20210809.pdf
Labs_20210809.pdfLabs_20210809.pdf
Labs_20210809.pdf
 
#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup Performance
 
Monitor everything
Monitor everythingMonitor everything
Monitor everything
 
The Theory Of The Dom
The Theory Of The DomThe Theory Of The Dom
The Theory Of The Dom
 
The High Performance Web Application Lifecycle
The High Performance Web Application LifecycleThe High Performance Web Application Lifecycle
The High Performance Web Application Lifecycle
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
HTML5 and CSS3 Shizzle
HTML5 and CSS3 ShizzleHTML5 and CSS3 Shizzle
HTML5 and CSS3 Shizzle
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and you
 
Derbycon 8 - We Are the Artillery: Using Google Fu to Take Down the Grid
Derbycon 8 - We Are the Artillery: Using Google Fu to Take Down the GridDerbycon 8 - We Are the Artillery: Using Google Fu to Take Down the Grid
Derbycon 8 - We Are the Artillery: Using Google Fu to Take Down the Grid
 
Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?) Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?)
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Vilnius.js