Published on

Published in: Technology


  1. 1. SCRAPY FORDUMMIESChandler Huangprevia [at]
  2. 2. Related Resource Web Connection Urllib2 Httplib2 Request Screen Scraping lxml XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of thePython standard library) Beautiful Soup Provides a few simple methods for navigating, searching and modifying a parse tree Automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Sits on top of popular Python parsers like lxml and html5lib Deals with bad markup reasonably as well One drawback: it’s slow. Mechanize Programmatic web browsing
  3. 3. Why Scrapy ? Portable, open-source, 100% Python Scrapy is completely written in Python and runs onLinux, Windows, Mac and BSD Only works for Python 2.6, 2.7 currently Simple Scrapy was designed with simplicity in mind, by providing the featuresyou need without getting in your way Productive Just write the rules to extract the data from web pages and let Scrapycrawl the entire web site for you Extensible Scrapy was designed with extensibility in mind and so it providesseveral mechanisms to plug new code without having to touch theframework core
  4. 4.  Batteries included Scrapy comes with lots of functionality built in. Well-documented & well-tested Scrapy is extensively documented and has an comprehensive test suitewith very good code coverage Good community and commercial support!forum/scrapy-users #scrapy @ freenode Company like Parsely Direct Employers Foundation Scrapinghub
  5. 5. Architecture overview
  6. 6. Project layoutscrapy.cfg: the project configuration filetutorial/: the project’s python module, you’ll later import your code from here.tutorial/ the project’s items file.tutorial/ the project’s pipelines file.tutorial/ the project’s settings file.tutorial/spiders/: a directory where you’ll later put your spiders.
  7. 7. Basic SOP1. Define item: decide what to extract2. Define Spider: decide crawling strategy3. Define parse function: find patterns to extractdata4. Pipeline: Define post process
  8. 8. Live DemoExample
  9. 9. Spider Spiders are classes which define how to perform the crawl how to extract structured data from their pages Build-in Spider BaseSpider Simplest spider, and the one from which every other spidermust inherit from start_requests() generates Request for the URLs specified inthe start_urls And the parse() method as default callback function for theRequests Parse() response and returneither Item objects, Request objects, or an iterable of both.
  10. 10. Spider CrawlSpider This is the most commonly used spider for crawling regularwebsites It provides a convenient mechanism for following links bydefining a set of rules Rules Which is a list of one (or more) Rule objects. Each Rule defines a certain behavior for crawling the site. BaseSpider: Customize BFS CrawlSpider: DFS Other build-in spiders XMLFeedSpider, CSVFeedSpider, SitemapSpider
  11. 11. Parse() Selector Scrapy comes with its own mechanism for extracting data. XPath is a language for selecting nodes in XML documents,which can also be used with HTML. Scrapy Selectors are built over the libxml2 library Same with lxml , which means they’re very similar in speedand parsing accuracy It also support RE Re vs XPath
  12. 12. XPathExpression Meaningname matches all nodes on the current level with the specifiednamename[n] matches the nth element on the current level with thespecified name/ if used as the first character, denotes the top-leveldocument, otherwise denotes moving down a level// the current level and all sublevels to any depth* matches all nodes on the current level. Or .. the current level / go up one level@name the attribute with the specified name[@key=value] all elements with an attribute that matches the specifiedkey/value pairname[@key=value]all elements with the specified name and an attribute thatmatches the specified key/value pair[text()=value] all elements with the specified textname[text()=valu all elements with the specified name and text
  13. 13. CrawlSpider
  14. 14. Pipeline For post process All item will pass to process_item by default Add pepeline to ITEM_PIPELINES
  15. 15. Control Method Telnet Web services Scrapyd Using json-rpc to control