Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Technology


  1. 1. SCRAPY FORDUMMIESChandler Huangprevia [at]
  2. 2. Related Resource Web Connection Urllib2 Httplib2 Request Screen Scraping lxml XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of thePython standard library) Beautiful Soup Provides a few simple methods for navigating, searching and modifying a parse tree Automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Sits on top of popular Python parsers like lxml and html5lib Deals with bad markup reasonably as well One drawback: it’s slow. Mechanize Programmatic web browsing
  3. 3. Why Scrapy ? Portable, open-source, 100% Python Scrapy is completely written in Python and runs onLinux, Windows, Mac and BSD Only works for Python 2.6, 2.7 currently Simple Scrapy was designed with simplicity in mind, by providing the featuresyou need without getting in your way Productive Just write the rules to extract the data from web pages and let Scrapycrawl the entire web site for you Extensible Scrapy was designed with extensibility in mind and so it providesseveral mechanisms to plug new code without having to touch theframework core
  4. 4.  Batteries included Scrapy comes with lots of functionality built in. Well-documented & well-tested Scrapy is extensively documented and has an comprehensive test suitewith very good code coverage Good community and commercial support!forum/scrapy-users #scrapy @ freenode Company like Parsely Direct Employers Foundation Scrapinghub
  5. 5. Architecture overview
  6. 6. Project layoutscrapy.cfg: the project configuration filetutorial/: the project’s python module, you’ll later import your code from here.tutorial/ the project’s items file.tutorial/ the project’s pipelines file.tutorial/ the project’s settings file.tutorial/spiders/: a directory where you’ll later put your spiders.
  7. 7. Basic SOP1. Define item: decide what to extract2. Define Spider: decide crawling strategy3. Define parse function: find patterns to extractdata4. Pipeline: Define post process
  8. 8. Live DemoExample
  9. 9. Spider Spiders are classes which define how to perform the crawl how to extract structured data from their pages Build-in Spider BaseSpider Simplest spider, and the one from which every other spidermust inherit from start_requests() generates Request for the URLs specified inthe start_urls And the parse() method as default callback function for theRequests Parse() response and returneither Item objects, Request objects, or an iterable of both.
  10. 10. Spider CrawlSpider This is the most commonly used spider for crawling regularwebsites It provides a convenient mechanism for following links bydefining a set of rules Rules Which is a list of one (or more) Rule objects. Each Rule defines a certain behavior for crawling the site. BaseSpider: Customize BFS CrawlSpider: DFS Other build-in spiders XMLFeedSpider, CSVFeedSpider, SitemapSpider
  11. 11. Parse() Selector Scrapy comes with its own mechanism for extracting data. XPath is a language for selecting nodes in XML documents,which can also be used with HTML. Scrapy Selectors are built over the libxml2 library Same with lxml , which means they’re very similar in speedand parsing accuracy It also support RE Re vs XPath
  12. 12. XPathExpression Meaningname matches all nodes on the current level with the specifiednamename[n] matches the nth element on the current level with thespecified name/ if used as the first character, denotes the top-leveldocument, otherwise denotes moving down a level// the current level and all sublevels to any depth* matches all nodes on the current level. Or .. the current level / go up one level@name the attribute with the specified name[@key=value] all elements with an attribute that matches the specifiedkey/value pairname[@key=value]all elements with the specified name and an attribute thatmatches the specified key/value pair[text()=value] all elements with the specified textname[text()=valu all elements with the specified name and text
  13. 13. CrawlSpider
  14. 14. Pipeline For post process All item will pass to process_item by default Add pepeline to ITEM_PIPELINES
  15. 15. Control Method Telnet Web services Scrapyd Using json-rpc to control