SlideShare a Scribd company logo
1 of 22
Download to read offline
Scrapy
Patrick O’Brien | @obdit
DataPhilly | 20131118 | Monetate
Steps of data science
●
●
●
●
●

Obtain
Scrub
Explore
Model
iNterpret
Steps of data science
●
●
●
●
●

Obtain
Scrub
Explore
Model
iNterpret

Scrapy
About Scrapy
●
●
●
●

Framework for collecting data
Open source - 100% python
Simple and well documented
OSX, Linux, Windows, BSD
Some of the features
● Built-in selectors
● Generating feed output
○ Format: json, csv, xml
○ Storage: local, FTP, S3, stdout

●
●
●
●

Encoding and autodetection
Stats collection
Control via a web service
Handle cookies, auth, robots.txt, user-agent
Scrapy Architecture
Data flow
1.
2.
3.
4.
5.
6.
7.

Engine opens, locates Spider, schedule first url as a Request
Scheduler sends url to the Engine, which sends it to Downloader
Downloader sends completed page as a Response through the
middleware to the engine
Engine sends Response to the Spider through middleware
Spiders sends Items and new Requests to the Engine
Engine sends Items to the Item Pipeline and Requests to the Scheduler
GOTO 2
Parts of Scrapy
●
●
●
●
●
●

Items
Spider
Link Extractors
Selectors
Request
Responses
Items
● Main container of structured information
● dict-like objects
from scrapy.item import Item, Field
class Product(Item):
name = Field()
price = Field()
stock = Field()
last_updated = Field(serializer=str)
Items
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Spiders
● Define how to move around a site
○ which links to follow
○ how to extract data

● Cycle
○ Initial request and callback
○ Store parsed content
○ Subsequent requests and callbacks
Generic Spiders
●
●
●
●
●

BaseSpider
CrawlSpider
XMLFeedSpider
CSVFeedSpider
SitemapSpider
BaseSpider
● Every other spider inherits from BaseSpider
● Two jobs
○ Request `start_urls`
○ Callback `parse` on resulting response
BaseSpider
...
class MySpider(BaseSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [

● Send Requests example.
com/[1:3].html

'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]

● Yield title Item

def parse(self, response):
sel = Selector(response)
for h3 in sel.xpath('//h3').extract():
yield MyItem(title=h3)

for url in sel.xpath('//a/@href').extract():
yield Request(url, callback=self.parse)

● Yield new Request
CrawlSpider
● Provide a set of rules on what links to follow
○ `link_extractor`
○ `call_back`
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item.php', )), callback='parse_item'),
)
Link Extractors
● Extract links from Response objects
● Parameters include:
○
○
○
○
○
○

allow | deny
allow_domains | deny_domains
deny_extensions
restrict_xpaths
tags
attrs
Selectors
● Mechanisms for extracting data from HTML
● Built over the lxml library
● Two methods
○ XPath: sel.xpath('//a[contains(@href,

"image")]/@href' ).

extract()

○

CSS: sel.css('a[href*=image]::attr(href)' ).extract()

● Response object is called into Selector
○

sel = Selector(response)
Request
● Generated in Spider, sent to Downloader
● Represent an HTTP request
● FormRequest subclass performs HTTP
POST
○ useful to simulate user login
Response
● Comes from Downloader and sent to Spider
● Represents HTTP response
● Subclasses
○ TextResponse
○ HTMLResponse
○ XmlResponse
Advanced Scrapy
● Scrapyd
○ application to deploy and run Scrapy spiders
○ deploy projects and control with JSON API

● Signals
○ notify when events occur
○ hook into Signals API for advance tuning

● Extensions
○ Custom functionality loaded at Scrapy startup
More information
●
●
●
●

http://doc.scrapy.org/
https://twitter.com/ScrapyProject
https://github.com/scrapy
http://scrapyd.readthedocs.org
Demo

More Related Content

What's hot

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
N hidden gems you didn't know hippo delivery tier and hippo (forge) could give
N hidden gems you didn't know hippo delivery tier and hippo (forge) could giveN hidden gems you didn't know hippo delivery tier and hippo (forge) could give
N hidden gems you didn't know hippo delivery tier and hippo (forge) could giveWoonsan Ko
 
Approach to find critical vulnerabilities
Approach to find critical vulnerabilitiesApproach to find critical vulnerabilities
Approach to find critical vulnerabilitiesAshish Kunwar
 
Application Logging With The ELK Stack
Application Logging With The ELK StackApplication Logging With The ELK Stack
Application Logging With The ELK Stackbenwaine
 
LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeJames Turnbull
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
 
Quicli - From zero to a full CLI application in a few lines of Rust
Quicli - From zero to a full CLI application in a few lines of RustQuicli - From zero to a full CLI application in a few lines of Rust
Quicli - From zero to a full CLI application in a few lines of RustDamien Castelltort
 
Do something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databasesDo something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databasesBruce McPherson
 
Python Development (MongoSF)
Python Development (MongoSF)Python Development (MongoSF)
Python Development (MongoSF)Mike Dirolf
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineAndy McKay
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
Do something in 5 with gas 2-graduate to a database
Do something in 5 with gas 2-graduate to a databaseDo something in 5 with gas 2-graduate to a database
Do something in 5 with gas 2-graduate to a databaseBruce McPherson
 

What's hot (20)

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
Web Scrapping with Python
Web Scrapping with PythonWeb Scrapping with Python
Web Scrapping with Python
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
N hidden gems you didn't know hippo delivery tier and hippo (forge) could give
N hidden gems you didn't know hippo delivery tier and hippo (forge) could giveN hidden gems you didn't know hippo delivery tier and hippo (forge) could give
N hidden gems you didn't know hippo delivery tier and hippo (forge) could give
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
 
Fun with Python
Fun with PythonFun with Python
Fun with Python
 
Approach to find critical vulnerabilities
Approach to find critical vulnerabilitiesApproach to find critical vulnerabilities
Approach to find critical vulnerabilities
 
Application Logging With The ELK Stack
Application Logging With The ELK StackApplication Logging With The ELK Stack
Application Logging With The ELK Stack
 
LogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesomeLogStash - Yes, logging can be awesome
LogStash - Yes, logging can be awesome
 
Using Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibanaUsing Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibana
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
 
CouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text SearchCouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text Search
 
Quicli - From zero to a full CLI application in a few lines of Rust
Quicli - From zero to a full CLI application in a few lines of RustQuicli - From zero to a full CLI application in a few lines of Rust
Quicli - From zero to a full CLI application in a few lines of Rust
 
Do something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databasesDo something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databases
 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
 
Python Development (MongoSF)
Python Development (MongoSF)Python Development (MongoSF)
Python Development (MongoSF)
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
Do something in 5 with gas 2-graduate to a database
Do something in 5 with gas 2-graduate to a databaseDo something in 5 with gas 2-graduate to a database
Do something in 5 with gas 2-graduate to a database
 

Viewers also liked

Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 

Viewers also liked (6)

Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Hadoop Now, Next & Beyond
Hadoop Now, Next & BeyondHadoop Now, Next & Beyond
Hadoop Now, Next & Beyond
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
SQL in Hadoop
SQL in HadoopSQL in Hadoop
SQL in Hadoop
 

Similar to Scrapy talk at DataPhilly

Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabszekeLabs Technologies
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumCüneyt Yeşilkaya
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
SF Grails - Ratpack - Compact Groovy Webapps - James Williams
SF Grails - Ratpack - Compact Groovy Webapps - James WilliamsSF Grails - Ratpack - Compact Groovy Webapps - James Williams
SF Grails - Ratpack - Compact Groovy Webapps - James WilliamsPhilip Stehlik
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingAlessandro Molina
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheusBob Cotton
 
Maximizer 2018 API training
Maximizer 2018 API trainingMaximizer 2018 API training
Maximizer 2018 API trainingMurylo Batista
 
Sphinx + robot framework = documentation as result of functional testing
Sphinx + robot framework = documentation as result of functional testingSphinx + robot framework = documentation as result of functional testing
Sphinx + robot framework = documentation as result of functional testingplewicki
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Danny Abukalam
 
Akash rajguru project report sem v
Akash rajguru project report sem vAkash rajguru project report sem v
Akash rajguru project report sem vAkash Rajguru
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUPRonald Hsu
 
Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Dimitar Danailov
 
Single page apps_with_cf_and_angular[1]
Single page apps_with_cf_and_angular[1]Single page apps_with_cf_and_angular[1]
Single page apps_with_cf_and_angular[1]ColdFusionConference
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 

Similar to Scrapy talk at DataPhilly (20)

Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
 
Revealing ALLSTOCKER
Revealing ALLSTOCKERRevealing ALLSTOCKER
Revealing ALLSTOCKER
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - Sunum
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
SF Grails - Ratpack - Compact Groovy Webapps - James Williams
SF Grails - Ratpack - Compact Groovy Webapps - James WilliamsSF Grails - Ratpack - Compact Groovy Webapps - James Williams
SF Grails - Ratpack - Compact Groovy Webapps - James Williams
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears Training
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
Maximizer 2018 API training
Maximizer 2018 API trainingMaximizer 2018 API training
Maximizer 2018 API training
 
Mini Curso de Django
Mini Curso de DjangoMini Curso de Django
Mini Curso de Django
 
Sphinx + robot framework = documentation as result of functional testing
Sphinx + robot framework = documentation as result of functional testingSphinx + robot framework = documentation as result of functional testing
Sphinx + robot framework = documentation as result of functional testing
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
 
Catalyst MVC
Catalyst MVCCatalyst MVC
Catalyst MVC
 
Akash rajguru project report sem v
Akash rajguru project report sem vAkash rajguru project report sem v
Akash rajguru project report sem v
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
 
Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014
 
Single page apps_with_cf_and_angular[1]
Single page apps_with_cf_and_angular[1]Single page apps_with_cf_and_angular[1]
Single page apps_with_cf_and_angular[1]
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Scrapy talk at DataPhilly

  • 1. Scrapy Patrick O’Brien | @obdit DataPhilly | 20131118 | Monetate
  • 2. Steps of data science ● ● ● ● ● Obtain Scrub Explore Model iNterpret
  • 3. Steps of data science ● ● ● ● ● Obtain Scrub Explore Model iNterpret Scrapy
  • 4. About Scrapy ● ● ● ● Framework for collecting data Open source - 100% python Simple and well documented OSX, Linux, Windows, BSD
  • 5. Some of the features ● Built-in selectors ● Generating feed output ○ Format: json, csv, xml ○ Storage: local, FTP, S3, stdout ● ● ● ● Encoding and autodetection Stats collection Control via a web service Handle cookies, auth, robots.txt, user-agent
  • 7. Data flow 1. 2. 3. 4. 5. 6. 7. Engine opens, locates Spider, schedule first url as a Request Scheduler sends url to the Engine, which sends it to Downloader Downloader sends completed page as a Response through the middleware to the engine Engine sends Response to the Spider through middleware Spiders sends Items and new Requests to the Engine Engine sends Items to the Item Pipeline and Requests to the Scheduler GOTO 2
  • 8. Parts of Scrapy ● ● ● ● ● ● Items Spider Link Extractors Selectors Request Responses
  • 9. Items ● Main container of structured information ● dict-like objects from scrapy.item import Item, Field class Product(Item): name = Field() price = Field() stock = Field() last_updated = Field(serializer=str)
  • 10. Items >>> product = Product(name='Desktop PC', price=1000) >>> print product Product(name='Desktop PC', price=1000) >>> product['name'] Desktop PC >>> product.get('name') Desktop PC >>> product.keys() ['price', 'name'] >>> product.items() [('price', 1000), ('name', 'Desktop PC')]
  • 11. Spiders ● Define how to move around a site ○ which links to follow ○ how to extract data ● Cycle ○ Initial request and callback ○ Store parsed content ○ Subsequent requests and callbacks
  • 13. BaseSpider ● Every other spider inherits from BaseSpider ● Two jobs ○ Request `start_urls` ○ Callback `parse` on resulting response
  • 14. BaseSpider ... class MySpider(BaseSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ● Send Requests example. com/[1:3].html 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] ● Yield title Item def parse(self, response): sel = Selector(response) for h3 in sel.xpath('//h3').extract(): yield MyItem(title=h3) for url in sel.xpath('//a/@href').extract(): yield Request(url, callback=self.parse) ● Yield new Request
  • 15. CrawlSpider ● Provide a set of rules on what links to follow ○ `link_extractor` ○ `call_back` rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(SgmlLinkExtractor(allow=('category.php', ), deny=('subsection.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(SgmlLinkExtractor(allow=('item.php', )), callback='parse_item'), )
  • 16. Link Extractors ● Extract links from Response objects ● Parameters include: ○ ○ ○ ○ ○ ○ allow | deny allow_domains | deny_domains deny_extensions restrict_xpaths tags attrs
  • 17. Selectors ● Mechanisms for extracting data from HTML ● Built over the lxml library ● Two methods ○ XPath: sel.xpath('//a[contains(@href, "image")]/@href' ). extract() ○ CSS: sel.css('a[href*=image]::attr(href)' ).extract() ● Response object is called into Selector ○ sel = Selector(response)
  • 18. Request ● Generated in Spider, sent to Downloader ● Represent an HTTP request ● FormRequest subclass performs HTTP POST ○ useful to simulate user login
  • 19. Response ● Comes from Downloader and sent to Spider ● Represents HTTP response ● Subclasses ○ TextResponse ○ HTMLResponse ○ XmlResponse
  • 20. Advanced Scrapy ● Scrapyd ○ application to deploy and run Scrapy spiders ○ deploy projects and control with JSON API ● Signals ○ notify when events occur ○ hook into Signals API for advance tuning ● Extensions ○ Custom functionality loaded at Scrapy startup
  • 22. Demo