Complete step by step guide to learn DATA SCIENCE skills for scraping websites!!
Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is
unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database
so that it can be used in various applications.
There are many different ways to perform web scraping to obtain data from websites. These include using online
services, particular API’s or even creating your code for web scraping from scratch. Many large websites, like
Google, Twitter, Facebook, StackOverow, etc. have API’s that allow you to access their data in a structured
format.
This is the best option, but there are other sites that don’t allow users to access large amounts of data in a
structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web
Scraping to scrape the website for data, To learn more checkout webscraping
1. ScrapingZomato'sTop100Restaurantsusing
Selenium
Launched in 2010, This technology platform connects customers, restaurant partners and delivery partners,
serving their multiple needs.
Customers use their platform to search and discover restaurants, read and write customer generated reviews and
view and upload photos, order food delivery, book a table and make payments while dining-out at restaurants.
On the other hand, They provide restaurant partners with industry-speci c marketing tools which enable them to
engage and acquire customers to grow their business while also providing a reliable and e cient last mile delivery
service.If you have not come across "ZOMATO" yet, welcome to planet Earth and do checkout zomato
Let'swalkyouthrough'WEBSCRAPING'!!
2. Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is
unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database
so that it can be used in various applications.
There are many different ways to perform web scraping to obtain data from websites. These include using online
services, particular API’s or even creating your code for web scraping from scratch. Many large websites, like
Google, Twitter, Facebook, StackOver ow, etc. have API’s that allow you to access their data in a structured
format.
This is the best option, but there are other sites that don’t allow users to access large amounts of data in a
structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web
Scraping to scrape the website for data, To learn more checkout webscraping
Objective:
Scraping the best 100 listings on zomato by parsing the information from this website in the form of Tabular data.
Listofdetailswearelookingonwebsite:
1-Top 100 Listings Of Restaurants For Each Location.
2-The 'Name' Of The Restaurants For Each Location.
3-The 'Ratings' Of Dining At The Restaurants For Each Location.
4-The 'Link' Of Restaurants For Each Location.
3. Outlineoftheproject:
1- Understanding The Structure Of Zomato's Website
2- Installing And Importing Required Libraries
3- Simulating The Page And Extracting The Name, Ratings, URLs Of Different restaurants From Website using
selenium.
4- Accessing each Restaurants And Building A Method To Locate Exact Location Of Restaurant Name,Ratings
And Urls For Top 100 Places.
5- Parsing The Top 100 Restaurants For Each Location consisting Details Of Name Of The Place, Dining Ratings Of
The Place, Link Of The Place, Using Helper Functions.
6- Storing The Extracted Data Into A Dictionary.
7- Compiling All The Data Into A DataFrame Using Pandas And Saving The Data Into CSV File.
Use the "Run" button to execute the code.
By The End Of The Project we will Create DataFrame In The Following Format:
4. ProjectCodeOnReplit
The code which has been used for this project is publicly available at the replit platform.Feel free to explore the
code and make changes for the betterment of the code to make it more e cient.Let's get on the road to identify
how the details are fetched and scraped for this project.
Replit Platform
TheListOfPackagesUsed
FIRST-- SELENIUM -- what is selenium
SECOND -- PANDAS -- what is pandas
THIRD -- TIME -- why do we use TIME
FOURTH -- OS -- why do we use OS
Let'sDiscussTheStepsInTheProject
1STSTEP
Atthebeginningoftheproject,weimporttherequiredpackagesneeded,asshown
below:-
5. import os
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
3RDSTEP
Creatingahelperfunction,togetthelistofdetailsfromthewebsitecontaining
'Restaurants'Name.Wecallit'res_name(driver)':-
def res_name(driver):
2NDSTEP
Let'screateafunctiontocreatethewebdriverthatwewillusetoextractwebpage
information.Thedriverfunctionisasfollows:-
def get_driver():
chrome_options=Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-dev-shm-usgae')
#to access the zomato's website we need to setup a 'user-agent' access, we cant access the website without
creating a standard 'user-agent'.learn more about user-agent setup
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
chrome_options.add_argument('user-agent= {0}'.format(user_agent))
driver = webdriver.Chrome(options=chrome_options)
return driver
#calling the driver to carry out further steps
driver=get_driver()
6. place_divs_tag = 'sc-bke1zw-0'
places = driver.find_element(By.CLASS_NAME,place_divs_tag)
tags=places.find_elements(By.CLASS_NAME,'sc-bke1zw-1')
res_names = []
for i in tags:
res_names.append(i.find_element(By.XPATH,".//div/section/div[1]/a").text)
return res_names[:100]
#here we fetched for the common class_name having the details of all the required restaurants,then we fetch the
common way to call all the 'NAME' of the places by using XPATH. #to learn about XPATH click here #we need to
understand the html code structure before we scrape any website.to learn about html click here
ALittleBriefOnHTMLAndXPATH
Before we go deeper into the explanation of the code, it is imperative that readers have a basic understanding of
HTML, the language of the web, and Xpaths, which are used to navigate through elements and attributes in an
HTML/XML document. HTML (HyperText Markup Language) is the code that is used to structure a web page and
its content. For example, content could be structured within a set of paragraphs, a list of bulleted points, or using
images and data tables. We will be using Xpaths to point to tags, attributes, and elements of an HTML webpage to
extract required information such as, in our case, Restaurants name, Restaurants Rating's, Restaurants URL's etc.
To avoid putting too much information into one notebook, and to save time for readers who are already familiar
with HTML and Xpaths.
4THSTEP
Creatingahelperfunction,togetthelistofdetailsforURL'Sfromthewebsite.Wecall
it'res_url(driver)':-
def res_url(driver):
place_divs_tag = 'sc-bke1zw-0'
7. places = driver.find_element(By.CLASS_NAME,place_divs_tag)
tags=places.find_elements(By.CLASS_NAME,'sc-bke1zw-1')
urls = []
for i in tags:
urls.append(i.find_element(By.TAG_NAME,"a").get_attribute('href'))
return urls[:100]
5THSTEP
Creatingahelperfunction,togetthelistofdetailsforRating'sfromthewebsite.We
callit'res_ratings(driver)':-
def res_ratings(driver):
place_divs_tag = 'sc-bke1zw-0'
places = driver.find_element(By.CLASS_NAME,place_divs_tag)
tags=places.find_elements(By.CLASS_NAME,'sc-bke1zw-1')
ratings = []
for i in tags:
try: ratings.append(i.find_element(By.CLASS_NAME,'sc-1q7bklc-5').text)
except:
ratings.append('.')
return ratings[:100]
To avoid running into exception while running this code, we make use of the method of 'TRY AND EXCEPT". here
you can learn more about try and except
6THSTEP
We create a parser function named "get_all_cities()" to extract the required details from the website containing the
required NAME,RATINGS,LINK in the form of dictionary. We create such a function which can be e cient
irrespective of the Location for eg: --mumbai,pune,bangalore,delhi,chandigarh etc..by creating this function we
get the required details which was the objective of this project and it can be done for any location, in this case we
are scraping for ' Mumbai , Bangalore, Pune '.
def get_all_cities():
cities = ['mumbai','bangalore','pune']
8. dic={'NAME':[],'RATINGS':[],'LINK':[]}
for i in cities:
base_url = 'https://www.zomato.com/'+ i + '/great-food-no-bull'
driver.get(base_url)
dic['NAME'].extend(res_name(driver))
dic['RATINGS'].extend(res_ratings(driver))
dic['LINK'].extend(res_url(driver))
return dic
7THSTEP
WecreateapandasDataFrameoftheparseddataandexportittoaCSVfilenamed
best100.csvandachievetheexpectedresultasshownagainbelow,
SUMMARY
It is quite fascinating that the amount of ease Webscraping brings to the life of all the CODERS. Summing up, We
essentially built a code in the Following steps: -we setup the required packages selenium,pandas,time and os.
-we create a helper function to get the Names,Ratings,Url's for the top 100 listings.
9. -we create a parser function to get the details of Name,Rating's,Url's for the top 100 listings for Three location's in
dictionary form.
-we create the proper DataFrame and save the work into a .CSV format.
FUTUREWORK
-The code can be accessed to get the different location and fetch the similar details from those location's by
changing the cities.
-more details like the restaurants contact number can be fetched using proper path for each place.
-we can have a understanding of most consistent restaurant which is likely to be in the top 50 listings on an
average and help provide proper understanding for some investors who would possibly like to invest in successful
restaurants.
-this can be done for any location which is the beauty of this project, which can be always worked on for the
betterment with time.
-The Project can be setup on a service like AWS Lambda for automatic timed scraping.
REFERENCES
Jovian web scraping with python
webscraping
Complete guide on Selenium
HTML Tutorial
XPATH
pandas
exception
!pip install jovian --upgrade --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="web-scraping-project")
[jovian] Updating notebook "hai-advisoryservices/web-scraping-project" on
https://jovian.ai
[jovian] Committed successfully! https://jovian.ai/hai-advisoryservices/web-scraping-
project
'https://jovian.ai/hai-advisoryservices/web-scraping-project'