SlideShare a Scribd company logo
scraping,




                               http://www.flickr.com/photos/juan23/82888194/
 scripting and
 hacking your way to
 API-less data
[AKA: if you don’t have data
feeds, we’ll get it anyway]
overview

•   “getting data out”
•   non-exhaustive (and rapid!)
•   slightly random
•   live examples (hopefully)
•   mainly non-technical(ish)
•   mainly non-illegal. I think.
anything goes

•   have no fear!
•   feel no remorse!
•   be shameless!
•   long live the open data revolution!
you

• half newbie, half “done some”
me

• not really a developer
• ..but code enough ASP (stop giggling)
to do what I want to do
• slides will be at slideshare.net/dmje
• www.electronicmuseum.org.uk
• mike.ellis@eduserv.org.uk
we <3 data

• we want programmatic access...
• ...but sites are often lacking
• ...and APIs are usually a pipe dream

 http://www.ucas.com/instit/i/h60.html




                                         http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
scraping

 • copy & paste, without having to copy &
 paste...
 • an inexact but really rather beautiful
 science




Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0")

Call xmlhttp.Open("GET",url,False)
Call xmlhttp.send

ReturnedXML = xmlhttp.responsetext
scraping (cont)

• frowned on by purists...
• but really rather powerful
• http://hoard.it
extraction #1: Y!Pipes

•   find your data on page
•   view source
•   determine the delimeters
•   put it into Pipes
•   extract the output




                               originating page | output
extraction #2: Google Docs

• create a new google spreadsheet
• find the URL of the data you want
• identify how it is encapsulated (list/
table)
• use the importHTML() function (others for
feeds, xml, data, etc)
• dump out data as...CSV/XML/RSS/etc




                           originating page | output
extraction #3: dapper.net

• go to dapper.net/open
• identify several of the urls with the same
“shapes” that you want to scrape
• use the dapper dashboard to identify
content areas
• build the “dapp”
• pass in url’s of pages you want to extract
data from
• extract results from the output (xml,
flash, csv, etc)




                          originating page | output
extraction #4: YQL

•   view source on the page you want to grab
•   go to http://developer.yahoo.com/yql/console/
•   get your XPath hat on and build a query
•   grab the data from a RESTful query




      http://developer.yahoo.com/yql/console/?
      q=select%20*%20from%20html%20where%20url%3D
      %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq
      %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa
      %5B%40class%3D%22result%22%5D%27




                                   originating page | output
extraction #5: httrack

• grab a copy of httrack (or similar)from
  http://www.httrack.com/
• point it at the bit of the site you want,
make sure the filters are correct, and push
go...
• you now have a local copy of the site, to
munge as you see fit
extraction #6: hacked search

• get an API key from Yahoo!
• use it to search within a domain
• script a standard download script to pick
out each page and download it
• hack that mumma
• (variation on a theme: build a simple
spider...)
now you’ve got your data..

• once you’ve got your data, you usually
need to munge it...
munging #1: regex!

• I’m terrible at regex
• ([A-PR-UWYZ0-9][A-HK-Y0-9]
[AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}
[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)
• but it’s incredibly powerful...




                                            output
munging #2: find/replace

• use whatever scripting language you work
best with
• (even Word...)
• you’ll find that replace double space,
replace weird characters, replace paragraph
marks are about the most common needs
munging #3: mail merge!

• for rapid builds of html, javascript or
xml
• have a source document (often extracted or
munged from other sites) in Excel
• you can use filters to effectively grab
the data you need
• build the merge in Word, using the
“directory” option
• copy and paste the result out
munging #4: html removal

• have a function handy that you can pass a
block of html
• it is handy to have a script where you can
define which particular tags to remove or
leave in place
munging #5: html tidy

• grab a copy of html tidy from
 http://tidy.sourceforge.net/
• tidy is available as a downloadable .exe
or a component that you can pass data to in
your code
processing #1: Open Calais

• a service from Reuters for analysing
blocks of text for semantic “meaning”
• get an API key from Open Calais
• send data via a POST to the REST service
• retrieve results from the RDF
• OR...just paste your text into
http://sws.clearforest.com/calaisviewer/




                                             output
processing #2: Yahoo! TE

• a webservice for grabbing tags/terms from
blocks of text
• sign up for a Yahoo! API key
• pass your block of text using POST
• grab the results..




                                          output
processing #3: geo!

• go to http://developer.yahoo.com/geo !
the ugly sisters

• Access
• Excel (!)
the last resorts

• FOI (frankie!)
• OCR (me)
the very last resort..

• re-type it...
• (or use Amazon Mechanical Turk)
...any more?

More Related Content

Viewers also liked

CLV e Mídia Programática
CLV e Mídia ProgramáticaCLV e Mídia Programática
CLV e Mídia Programática
Sociomantic Labs
 
Top Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowTop Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to Know
InMobi
 
Calculating LTV Using Flurry
Calculating LTV Using FlurryCalculating LTV Using Flurry
Calculating LTV Using Flurry
Yaniv Nizan
 
Calculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsCalculating LTV Using Google Analytics
Calculating LTV Using Google Analytics
Yaniv Nizan
 
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert
 
Two Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetTwo Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a Spreadsheet
Eric Seufert
 
Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)
Demac Media
 
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/201411 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
Rob Moffat
 
A step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueA step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime value
Geoff Fripp
 

Viewers also liked (9)

CLV e Mídia Programática
CLV e Mídia ProgramáticaCLV e Mídia Programática
CLV e Mídia Programática
 
Top Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowTop Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to Know
 
Calculating LTV Using Flurry
Calculating LTV Using FlurryCalculating LTV Using Flurry
Calculating LTV Using Flurry
 
Calculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsCalculating LTV Using Google Analytics
Calculating LTV Using Google Analytics
 
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
 
Two Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetTwo Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a Spreadsheet
 
Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)
 
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/201411 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
 
A step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueA step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime value
 

Similar to Scraping Scripting Hacking

The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
jasonhaddix
 
Scrapy
ScrapyScrapy
Learning to code
Learning to codeLearning to code
Learning to code
Sara-Jayne Terp
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
Sais Abdelkrim
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
drgath
 
Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)
Krzysztof Kotowicz
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Esteve Castells
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
ComputerScienceJunct
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
Leo Loobeek
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
Jinx - Malware 2.0
Jinx - Malware 2.0Jinx - Malware 2.0
Jinx - Malware 2.0
Itzik Kotler
 
Data-Analytics using python (Module 4).pptx
Data-Analytics using python (Module 4).pptxData-Analytics using python (Module 4).pptx
Data-Analytics using python (Module 4).pptx
DRSHk10
 
[2010]我有一个梦想
[2010]我有一个梦想[2010]我有一个梦想
[2010]我有一个梦想
Twinsen Liang
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
bodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
Sara-Jayne Terp
 
Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Eric D.
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 session
Rob Dunn
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with Golang
Chris McEniry
 

Similar to Scraping Scripting Hacking (20)

The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Scrapy
ScrapyScrapy
Scrapy
 
Learning to code
Learning to codeLearning to code
Learning to code
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Jinx - Malware 2.0
Jinx - Malware 2.0Jinx - Malware 2.0
Jinx - Malware 2.0
 
Data-Analytics using python (Module 4).pptx
Data-Analytics using python (Module 4).pptxData-Analytics using python (Module 4).pptx
Data-Analytics using python (Module 4).pptx
 
[2010]我有一个梦想
[2010]我有一个梦想[2010]我有一个梦想
[2010]我有一个梦想
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 
Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 session
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with Golang
 

More from Mike Ellis

5 digital habits of highly effective museums
5 digital habits of highly effective museums5 digital habits of highly effective museums
5 digital habits of highly effective museums
Mike Ellis
 
How to stop freelance from killing you
How to stop freelance from killing youHow to stop freelance from killing you
How to stop freelance from killing you
Mike Ellis
 
Getting collections online
Getting collections onlineGetting collections online
Getting collections onlineMike Ellis
 
Why Wordpress is better than your cms
Why Wordpress is better than your cmsWhy Wordpress is better than your cms
Why Wordpress is better than your cms
Mike Ellis
 
Forget the objects, tell the stories
Forget the objects, tell the storiesForget the objects, tell the stories
Forget the objects, tell the stories
Mike Ellis
 
Bath Digital general introduction
Bath Digital general introductionBath Digital general introduction
Bath Digital general introductionMike Ellis
 
Stop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsStop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tips
Mike Ellis
 
Bathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistBathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeist
Mike Ellis
 
Strategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upStrategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things up
Mike Ellis
 
If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0)
Mike Ellis
 
Mobile: the next frontier
Mobile: the next frontierMobile: the next frontier
Mobile: the next frontier
Mike Ellis
 
Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?
Mike Ellis
 
The Intertubes Everywhere
The Intertubes EverywhereThe Intertubes Everywhere
The Intertubes Everywhere
Mike Ellis
 
Bathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearBathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The Year
Mike Ellis
 
The Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyThe Benefits Of Doing Things Differently
The Benefits Of Doing Things Differently
Mike Ellis
 
Collaboration 2.0
Collaboration 2.0Collaboration 2.0
Collaboration 2.0
Mike Ellis
 
Getting people together
Getting people togetherGetting people together
Getting people together
Mike Ellis
 
3 minutes, one technology: the piano
3 minutes, one technology: the piano3 minutes, one technology: the piano
3 minutes, one technology: the piano
Mike Ellis
 
Don't Think Websites, think data
Don't Think Websites, think dataDon't Think Websites, think data
Don't Think Websites, think data
Mike Ellis
 
Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"
Mike Ellis
 

More from Mike Ellis (20)

5 digital habits of highly effective museums
5 digital habits of highly effective museums5 digital habits of highly effective museums
5 digital habits of highly effective museums
 
How to stop freelance from killing you
How to stop freelance from killing youHow to stop freelance from killing you
How to stop freelance from killing you
 
Getting collections online
Getting collections onlineGetting collections online
Getting collections online
 
Why Wordpress is better than your cms
Why Wordpress is better than your cmsWhy Wordpress is better than your cms
Why Wordpress is better than your cms
 
Forget the objects, tell the stories
Forget the objects, tell the storiesForget the objects, tell the stories
Forget the objects, tell the stories
 
Bath Digital general introduction
Bath Digital general introductionBath Digital general introduction
Bath Digital general introduction
 
Stop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsStop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tips
 
Bathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistBathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeist
 
Strategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upStrategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things up
 
If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0)
 
Mobile: the next frontier
Mobile: the next frontierMobile: the next frontier
Mobile: the next frontier
 
Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?
 
The Intertubes Everywhere
The Intertubes EverywhereThe Intertubes Everywhere
The Intertubes Everywhere
 
Bathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearBathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The Year
 
The Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyThe Benefits Of Doing Things Differently
The Benefits Of Doing Things Differently
 
Collaboration 2.0
Collaboration 2.0Collaboration 2.0
Collaboration 2.0
 
Getting people together
Getting people togetherGetting people together
Getting people together
 
3 minutes, one technology: the piano
3 minutes, one technology: the piano3 minutes, one technology: the piano
3 minutes, one technology: the piano
 
Don't Think Websites, think data
Don't Think Websites, think dataDon't Think Websites, think data
Don't Think Websites, think data
 
Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Scraping Scripting Hacking

  • 1. scraping, http://www.flickr.com/photos/juan23/82888194/ scripting and hacking your way to API-less data [AKA: if you don’t have data feeds, we’ll get it anyway]
  • 2. overview • “getting data out” • non-exhaustive (and rapid!) • slightly random • live examples (hopefully) • mainly non-technical(ish) • mainly non-illegal. I think.
  • 3. anything goes • have no fear! • feel no remorse! • be shameless! • long live the open data revolution!
  • 4. you • half newbie, half “done some”
  • 5. me • not really a developer • ..but code enough ASP (stop giggling) to do what I want to do • slides will be at slideshare.net/dmje • www.electronicmuseum.org.uk • mike.ellis@eduserv.org.uk
  • 6. we <3 data • we want programmatic access... • ...but sites are often lacking • ...and APIs are usually a pipe dream http://www.ucas.com/instit/i/h60.html http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
  • 7. scraping • copy & paste, without having to copy & paste... • an inexact but really rather beautiful science Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0") Call xmlhttp.Open("GET",url,False) Call xmlhttp.send ReturnedXML = xmlhttp.responsetext
  • 8. scraping (cont) • frowned on by purists... • but really rather powerful • http://hoard.it
  • 9. extraction #1: Y!Pipes • find your data on page • view source • determine the delimeters • put it into Pipes • extract the output originating page | output
  • 10. extraction #2: Google Docs • create a new google spreadsheet • find the URL of the data you want • identify how it is encapsulated (list/ table) • use the importHTML() function (others for feeds, xml, data, etc) • dump out data as...CSV/XML/RSS/etc originating page | output
  • 11. extraction #3: dapper.net • go to dapper.net/open • identify several of the urls with the same “shapes” that you want to scrape • use the dapper dashboard to identify content areas • build the “dapp” • pass in url’s of pages you want to extract data from • extract results from the output (xml, flash, csv, etc) originating page | output
  • 12. extraction #4: YQL • view source on the page you want to grab • go to http://developer.yahoo.com/yql/console/ • get your XPath hat on and build a query • grab the data from a RESTful query http://developer.yahoo.com/yql/console/? q=select%20*%20from%20html%20where%20url%3D %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa %5B%40class%3D%22result%22%5D%27 originating page | output
  • 13. extraction #5: httrack • grab a copy of httrack (or similar)from http://www.httrack.com/ • point it at the bit of the site you want, make sure the filters are correct, and push go... • you now have a local copy of the site, to munge as you see fit
  • 14. extraction #6: hacked search • get an API key from Yahoo! • use it to search within a domain • script a standard download script to pick out each page and download it • hack that mumma • (variation on a theme: build a simple spider...)
  • 15. now you’ve got your data.. • once you’ve got your data, you usually need to munge it...
  • 16. munging #1: regex! • I’m terrible at regex • ([A-PR-UWYZ0-9][A-HK-Y0-9] [AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2} [0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA) • but it’s incredibly powerful... output
  • 17. munging #2: find/replace • use whatever scripting language you work best with • (even Word...) • you’ll find that replace double space, replace weird characters, replace paragraph marks are about the most common needs
  • 18. munging #3: mail merge! • for rapid builds of html, javascript or xml • have a source document (often extracted or munged from other sites) in Excel • you can use filters to effectively grab the data you need • build the merge in Word, using the “directory” option • copy and paste the result out
  • 19. munging #4: html removal • have a function handy that you can pass a block of html • it is handy to have a script where you can define which particular tags to remove or leave in place
  • 20. munging #5: html tidy • grab a copy of html tidy from http://tidy.sourceforge.net/ • tidy is available as a downloadable .exe or a component that you can pass data to in your code
  • 21. processing #1: Open Calais • a service from Reuters for analysing blocks of text for semantic “meaning” • get an API key from Open Calais • send data via a POST to the REST service • retrieve results from the RDF • OR...just paste your text into http://sws.clearforest.com/calaisviewer/ output
  • 22. processing #2: Yahoo! TE • a webservice for grabbing tags/terms from blocks of text • sign up for a Yahoo! API key • pass your block of text using POST • grab the results.. output
  • 23. processing #3: geo! • go to http://developer.yahoo.com/geo !
  • 24. the ugly sisters • Access • Excel (!)
  • 25. the last resorts • FOI (frankie!) • OCR (me)
  • 26. the very last resort.. • re-type it... • (or use Amazon Mechanical Turk)