Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web scraping using python

263 views

Published on

A brief introduction of Web scraping using Python. Discussing steps involved in web scraping with a demo.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Web scraping using python

  1. 1. Web scraping using Python Vidya Sagar, Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  2. 2. Agenda  What is web scraping?  Why scraping is needed?  Steps involved in web scraping  Analyzing web page structure  requests, python-selenium and BeautifulSoup  Demo  Q & A Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  3. 3. What is web scraping ? Web scraping also termed as Screen Scraping, Web Data Extraction, Web Harvesting etc., is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  4. 4. Need of web scraping ?  No APIs available  No Rate Limiting  Anonymous Access  Data Mining  Link checker  Content Building  Just for Fun, etc. Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  5. 5. Steps involved in Web Scraping  Analyze the web page  Grab content  Parse the content  Store result in database or file Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  6. 6. Analyze the web page
  7. 7. Grab the content (Making requests to website)  Console based application :  urllib/urllib2  requests  Browser automation tools :  selenium  mechanize Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  8. 8. requests  A python library to make request  Installation and Docs:  pip install requests  To install - http://docs.python-requests.org/en/master/user/install/#install  Get started tutorial - http://docs.python-requests.org/en/master/api/#requests.request Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  9. 9. requests Free proxy server list: http://www.freeproxylists.net/in.html User-Agent list : https://techblog.willshouse.com/2012/01/03/most-common-user-agents/
  10. 10. Selenium  Selenium is a web browser automation tool.  Basically built for test automation but it is also used in data scraping.  Selenium simulate human behavior on browser to distinguish between machine bot and human like machine bot. python-selenium  To install - http://selenium-python.readthedocs.io/installation.html  Get started tutorial – http://selenium-python.readthedocs.io/getting-started.html Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  11. 11. Parse the content  BeautifulSoup  HTMLParser  html.parser  Lxml, etc. Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  12. 12. BeautifulSoup  Beautiful Soup is a python library which pulls data from HTML and XML files, it parse HTML content. There are methods available to navigate and search HTML DOM.  To install - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup  Get started tutorial -https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick- start Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  13. 13. Store result in database or file Once data is collected save in desired location, database, CSV, file, etc. Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  14. 14. Demo Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  15. 15. Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)
  16. 16. You can reach me at .. Vidya Sagar – Member of Technical Staff VMWare Software India Pvt. Ltd. Email Id - Sagarv@vmware.com Gmail - vidyasagar0911@gmail.com Github - https://github.com/vidhack/ Linkedin - https://www.linkedin.com/in/vidya0911/ Mob. No. – 8826567548 Thank You Vidya Sagar - Member of Technical Staff (VMWare Software India Pvt. Ltd.)

×