Your SlideShare is downloading. ×
Web Crawling Modeling with Scrapy Models #TDC2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web Crawling Modeling with Scrapy Models #TDC2014

5,193
views

Published on

Modeling web crawlers with Scrapy and ScrapyModels

Modeling web crawlers with Scrapy and ScrapyModels

Published in: Internet

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,193
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
16
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Scrapy Models
  • 2. Web Crawling - Capturar conteudo desestruturado da web HTML, XML(?), Texto puro… - Parsear, validar e armazenar - Automatizar o processo
  • 3. {'links': [u'http://www.python.org/~guido/', u'http://neopythonic.blogspot.com/', u'http://www.artima.com/weblogs/index.jsp? blogger=guido', u'http://python-history.blogspot.com/', u'http://www.python.org/doc/essays/cp4e.html', u'http://www.twit.tv/floss11', u'http://www.computerworld.com.au/index.php/id; 66665771', u'http://www.stanford.edu/class/ee380/Abstracts/081105. html', u'http://stanford-online.stanford. edu/courses/ee380/081105-ee380-300.asx'], 'name': u'Guido van Rossum', 'nationality': u'Dutch', 'photo_url': 'http://en.m.wikipedia.org//wiki/File: Guido_van_Rossum_OSCON_2006.jpg', 'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}
  • 4. PYTHON ROCKS!!! - LXML - HTMLParser - Beautiful Soup - Scrapy - XMLToDict - Requests
  • 5. Beautiful Soup import requests from bs4 import BeautifulSoup html = requests.get(“http://schblaums.com”) soup = BeautifulSoup(html) user_picture = soup.find_all(“img”) [<Soup Object …. >, ...] user_picture[0].expand <img src=”/user/picture.jpg” />
  • 6. LXML from lxml import etree tree = etree.parse(html) user_pictures = tree.xpath(“*./img”) [<tree node>, <tree node>, …]
  • 7. CSS jQuery
  • 8. Scrapy - Framework de web crawling - Automatização do processo - Validação - Mapeamento de dados - Seletores com suporte a XPATH ou CCC
  • 9. Scrapy Model from mongoengine|django|* import Model class Person(Model): name = StringField() links = ListField() picture = ImageField()
  • 10. {'links': [u'http://www.python.org/~guido/', u'http://neopythonic.blogspot.com/', u'http://www.artima.com/weblogs/index.jsp? blogger=guido', u'http://python-history.blogspot.com/', u'http://www.python.org/doc/essays/cp4e.html', u'http://www.twit.tv/floss11', u'http://www.computerworld.com.au/index.php/id; 66665771', u'http://www.stanford.edu/class/ee380/Abstracts/081105. html', u'http://stanford-online.stanford. edu/courses/ee380/081105-ee380-300.asx'], 'name': u'Guido van Rossum', 'nationality': u'Dutch', 'photo_url': 'http://en.m.wikipedia.org//wiki/File: Guido_van_Rossum_OSCON_2006.jpg', 'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}
  • 11. What is scrapy_model ? It is just a helper to create scrapers using the Scrapy Selectors allowing you to select elements by CSS or by XPATH and structuring your scraper via Models (just like an ORM model) and plugable to an ORM model via populate method. Import the BaseFetcherModel, CSSField or XPathField (you can use both) from scrapy_model import BaseFetcherModel, CSSField Go to a webpage you want to scrap and use chrome dev tools or firebug to figure out the css paths then considering you want to get the following fragment from some page. <span id="person">Bruno Rocha <a href="http://brunorocha.org">website</a></span> class MyFetcher(BaseFetcherModel): name = CSSField('span#person') website = CSSField('span#person a') # XPathField('//xpath_selector_here')
  • 12. Multiple queries in a single field You can use multiple queries for a single field name = XPathField( ['//*[@id="8"]/div[2]/div/div[2]/div[2]/ul', '//*[@id="8"]/div[2]/div/div[3]/div[2]/ul'] ) In that case, the parsing will try to fetch by the first query and returns if finds a match, else it will try the subsequent queries until it finds something, or it will return an empty selector.
  • 13. Finding the best match by a query validator If you want to run multiple queries and also validates the best match you can pass a validator function which will take the scrapy selector an should return a boolean. Example, imagine you get the "name" field defined above and you want to validates each query to ensure it has a 'li' with a text "Schblaums" in there. def has_schblaums(selector): for li in selector.css('li'): # takes each <li> inside the ul selector li_text = li.css('::text').extract() # Extract only the text if "Schblaums" in li_text: # check if "Schblaums" is there return True # this selector is valid! return False # invalid query, take the next or default value class Fetcher(....): name = XPathField( ['//*[@id="8"]/div[2]/div/div[2]/div[2]/ul', '//*[@id="8"]/div[2]/div/div[3]/div[2]/ul'], query_validator=has_schblaums, default="undefined_name")
  • 14. Every method named parse_<field> will run after all the fields are fetched for each field. def parse_name(self, selector): # here selector is the scrapy selector for 'span#person' name = selector.css('::text').extract() return name def parse_website(self, selector): # here selector is the scrapy selector for 'span#person a' website_url = selector.css('::attr(href)').extract() return website_url after defined need to run the scraper fetcher = Myfetcher(url='http://.....') # optionally you can use cached_fetch=True to cache requests on redis fetcher.parse()
  • 15. https://github.com/rochacbruno/scrapy_model

×