Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy

116 views

Published on

Web Scraping Introduction. It will cover cover all the most available libraries and the way they can be handled to scrape our required data. Created by Littin Rajan

Published in: Software
  • Be the first to comment

  • Be the first to like this

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy

  1. 1. {Web Scraping} https://www.linkedin.com/in/littinrajan An Introduction to Web Scraping using Python http://littinrajan.wordpress.com/
  2. 2. AGENDA • What is Web Scraping? • Why it is needed? • How it Works? • How to do Massive Web Scraping? • Can we make it Automated?
  3. 3. WEB SCRAPING ‘Web Scraping’ is a technique for gathering structured data or information from web pages. It offers a quick way to acquire data which is presented on the web in a particular format. What is it?
  4. 4. WEB SCRAPING In some cases API’s are not capable enough to get the whole data that we want from web pages. We can anonymously access the website and gather data. It is not data limited. Why it is needed?
  5. 5. WEB SCRAPING 1. Accessing the target Website using HTTP library like requests, Urllib, httplib, etc. 2. Parse the content of the web using any Web Parsing library like Beautiful Soup, lxml, ReGex, etc. 3. Save the result to the required format like Database table, CSV, Excel, text file, etc. How it works?
  6. 6. WEB SCRAPING Requests Requests is a Python HTTP library which allows us to send HTTP requests using Python Part1: Accessing Data
  7. 7. WEB SCRAPING Urllib3 urllib3 is a powerful, HTTP client for Python Part1: Accessing Data
  8. 8. WEB SCRAPING httplib2 Httplib2 is a small, fast HTTP client library for Python. Features persistent connections, cache, and Google App Engine support Part1: Accessing Data
  9. 9. WEB SCRAPING BeautifulSoup4 Beautiful Soup is a Parsing library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. It is very easy to use. But slow in parsing. Part2: Parsing Content
  10. 10. WEB SCRAPING BeautifulSoup4 Can handle broken markup and can purely code in Python. Part2: Parsing Content
  11. 11. WEB SCRAPING lxml lxml is the most feature-rich and easy-to-use library for processing XML and HTML in Python which represents as an element tree. Very fast in processing. Codes cannot purely in Python Part2: Parsing Content
  12. 12. WEB SCRAPING lxml lxml is able to works with all python versions from 2.x to 3.x. Part2: Parsing Content
  13. 13. WEB SCRAPING RegEx Regex is a library which used to work with Regular Expressions. Based on our request pattern it is able to parse the data. It is used only to extract minute amount of text. In order to handle we should learn its symbols e.g '.',*,$,^,b,w,d Part2: Parsing Content
  14. 14. WEB SCRAPING RegEx Can purely code in Python. It is very fast and support all versions of Python. Part2: Parsing Content
  15. 15. WEB SCRAPING After parsing we will get the collection of data that we want to work with. Then we can convert it into the convenient format for later purpose. We can save the data into the various formats like DataBase table or Comma- Seperated Values(CSV) file or Excel file or Normal Text file. Part3: Saving Result
  16. 16. WEB SCRAPING Request library is much slower than all. But the advantage is that it supports restful API. Httplib2 consumes least execution time but it is hard to work with other languages. Time Comparison: Comparison: Http Libraries
  17. 17. WEB SCRAPING Beautifulsoup consumes more time to parse the data but it widely used because of it’s high support with other languages. RegEx is veery easy to usable and run faster but cannot work in complex situations. Time Comparison: Comparison: Parsing Libraries
  18. 18. WEB SCRAPING In some time it needed millions of web pages to be scraped everyday to get a solution. Most times the source web pages will change and it will become a havoc for you to get the required data. In some cases regex won’t work but beautifulsoup will. But the issue is that the output will be generated very slowly. How to do Massive Web Scraping?
  19. 19. WEB SCRAPING SCRAPY is the solution for Massive Web Scraping. It is a free and open-source web-crawling framework written in Python. It can also be used to extract data using APIs or as a general-purpose web crawler. It comprised with almost all tools that we want to work for web scraping. How to do Massive Web Scraping?
  20. 20. WEB SCRAPING  When there is millions of pages to scrape.  When you want asynchronous processing(multiple request at a time)  When the data is funky in nature and it is not properly formatted.  Pages with server issues.  Websites with login wall. Scrapy: When to Use?
  21. 21. WEB SCRAPING 1. Define a Scraper. 2. Defining Items to Extract. 3. Creating a Spider to Crawl. 4. Run the Scraper. Scrapy: WorkFlow
  22. 22. WEB SCRAPING First we have to define the scraper by building a project. It will create a directory with the required files and directories. Scrapy: Defining Scraper
  23. 23. WEB SCRAPING Root Directory will contain a configuration file ‘scrapy.cfg’ and project’s python module. The module folder will contain items file, pipeline file, settings file, middlewares file, a directory for putting spiders and init python file. Scrapy: Defining Scraper
  24. 24. WEB SCRAPING Items are the containers used to collect the data that is scrapped from the websites. We can define our items by editing ‘items.py’. Scrapy: Defining Items to Extract
  25. 25. WEB SCRAPING Spiders are classes which defines;  How a certain site will be scraped,  How to perform the crawl and  How to extract structured data from their pages. Scrapy: Creating a Spider to Crawl Here is how to create your spider with any sample template
  26. 26. WEB SCRAPING In order to crawl our data we have to define the callback function parse() It will collect the data of our interest. We can also define settings in spider like allowed domain settings, callback response, etc. Scrapy: Creating a Spider to Crawl
  27. 27. WEB SCRAPING After defining items and our crawler we can run our scraper by scrapy crawl command. We can also store scraped data by using Feed Exports. Scrapy also provides shell scripting using built-in Scrapy Shell. We can trigger the shell by the following way. Scrapy: Run the Scraper
  28. 28. WEB SCRAPING Automated code makes the process to be completed without any human intervention. Can easily pass through the walls of webpages without getting blocked. The solution is Selenium. It is one of the well known package which is used to automate web browser interaction. Also supports python. Can we make it Automated?
  29. 29. THANK YOU

×