SlideShare a Scribd company logo
1 of 13
Download to read offline
How To Crawl AmazonWebsite Using Python Scrapy?
Crawl products from Amazon.com with Python Scrapy. Download Meta Information
and Images for every item available in the pre-defined list.
Project Overview
Artificial Intelligence heavily relies on the data, and it's about quality. However, there is
a big problem with lacking data in Machine Learning. AI needs specific amounts of data
for practical analysis, performance, and training; if data adequacy is lacking, this won't
become possible to complete a dependable project. Amazon's site is a collection of an
extensive range of data kinds. This has high-quality data, including texts or images,
electronic items, and all types of fashion products. All these could be very useful in
making smaller-scale Machine Learning or Deep Learning Projects. With this project,
we would crawl amazon.in the website to download images for different user input
types. Our project will automatically make folders or download pictures and the
meta.txt file for all items. In addition, this project could get used for scraping data from
different web pages.
Create a Spider
imghow-to-crawl-amazon-website-using-python-scrapyCreate-a-Spider.jpg
Let’s start the proceedings…
Import Libraries
We should be utilizing Python and Scrapy language with a few of the libraries with
some knowledge about webpages. We would be creating a Spider from the Scrapy
library for crawling a website.
Now, let us define the constructor technique for a Spider class. The method asks a
user to input categories that require to get searched separately using white spaces.
All these categories would be searched using the spider on a website. We are
producing a URL link for all items and save in the listing for easy access for a loop.
We are utilizing a start_urls list for storing the starting URL and other_urls for
storing the future URLs depending on user inputs. We have a counter variable for
counting the complete URLs generated that will get used in the later procedures.
After that, we shall outline a method named parse. The technique would request a
webpage with a URL available in the start_urls listing and get a response in a
response variable. We need to stipulate the complete path of directories where we
can store the downloaded text and images. This needs to get joined using ‘/’ before
‘’ windows characteristics. Then, we shall get an item name with a response.url,
find the character ‘=’, and scrape the URL substring. For a user’s input mobile:-
The requested URL would be in the form of https://www.amazon.in/s?k=mobile,
with the same URLuld be in a response t also. Therefore, we get an item name from
there.
Using the item name, we will make a directory within a parent directory. xpath is a
named selector in Scrapy, and it is used for scraping certain information from the
webpage.
Here, we must understand that xpath would provide us data from the HTML webpage
using tags, which we pass within. For instance, xpath(‘//h2/a/@href’) would give us
information within the ‘href’ attribute within the ‘a’ tag in the ‘h2’ tag. We shall utilize
the extract() function to get the information. Therefore, this information gives a link
for the items that appear when the category gets searched, for example, Mobile.
Here, we limit data to 20 items as we wish to get the initial 20 appearances.
Correspondingly, we get item names under every category to make subfolders within
the class. Here, we limit the item name listing to 20.
The example category is Mobile, and the item names include Samsung Galaxy M33 5G
(Color: Emerald Brown with 6GB & 128GB Storage), etc., and all_links would have url-
links to those items.
After that, we iterate over the all_links listing to find every item. After that, we
perform specific alterations on item names items as they are long and have
particular unwanted chars. After that, we continue yielding Requests till we finish the
links. The requests object should send a response for parse_pages methods and pass
a path like dictionary parameters. Other parameters are also important in the
context of this project and are essential for smoother requests to domains.
We have used a parse method for crawling the webpage having a category while the
parse_pages process for crawling webpages for every item under the category. With the
parse_pages method, we first retrieve text for every item by performing some
operations using Regex (regular expressions) modification and xpath selection. The text
modifications are essential to remove particular chars and find text in readable formats.
After that, we write this text data in the file called meta.txt. After completing the
project, you can get samples of the text files and how they get saved
Using this code, we scrape image URLs that can be utilized to download and save in
a local folder. As we could request a webpage and this information many times, we
could be blocked from a domain. Therefore, we add the sleeping time of about 10
seconds if a domain server initially refuses the connection. We also use the Requests
object to pull replies for the following URLs in the other_urls listing.
The create_dir technique has been made to automatically make subfolders
depending on the items under every category.
Wonderful! Now, as we get the spider created and saved.Let's run that
and see its results.
To run a spider, we need to run an anaconda prompt or command prompt from a
directory where this has a scrapy.cfg file. The configuration file for Scrapy and a
spider Python file are available within the spiders' folder with this directory. Do a
scrapy project using the spiders.py file. Although creating the new project using
scrapy is easy, this could be looked at here.
To run a spider, we have to run a command called scrapy crawl spider_name.
Here, it might be scrapy crawling amazon_spider. Run that from the directory in
which the scrapy.cfg file is available as given below:-
Now, you can see the data coming, like bot names and more. However, the quick
stop once that outputs, “Enter items you need to search separated by spaces: ” …
Now the time is there to enter every category you need to search; in the case here
— “mobile tv t-shirts” all are separated by the spaces. Press enters it; now the
prompt would output different information, mainly having the requests sent with
the amazon.com domain. The graphical changes you could notice in the directory
where you get specified in saving files. Here, it was the Crawled_Items folder.
Wonderful! Our output would soon become ready! Let’s get some glimpses of the
downloaded content….
Folder created for every category.
Under every category, one bunch of folders for every items at maximum 20.
Under every item, some images with meta.txt file having informational features of
items.
Meta.txt file having information:-
Therefore, this marks the end of the project
Happy Coding!
For more information, contact iWeb Data Scraping. Contact us for web scraping
and mobile app scraping service requirements!
How To Crawl Amazon Website Using Python Scrapy.pdf

More Related Content

Similar to How To Crawl Amazon Website Using Python Scrapy.pdf

Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web DevelopmentRobert J. Stein
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
Compass Framework
Compass FrameworkCompass Framework
Compass FrameworkLukas Vlcek
 
Introduction to google hacking database
Introduction to google hacking databaseIntroduction to google hacking database
Introduction to google hacking databaseimthebeginner
 
Web Techology and google code sh (2014_10_10 08_57_30 utc)
Web Techology and google code sh (2014_10_10 08_57_30 utc)Web Techology and google code sh (2014_10_10 08_57_30 utc)
Web Techology and google code sh (2014_10_10 08_57_30 utc)Suyash Gupta
 
Progressive EPiServer Development
Progressive EPiServer DevelopmentProgressive EPiServer Development
Progressive EPiServer Developmentjoelabrahamsson
 
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...Bill Buchan
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocratlinoj
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddlerholiman
 
Introduction to whats new in css3
Introduction to whats new in css3Introduction to whats new in css3
Introduction to whats new in css3Usman Mehmood
 
Why use .net by naveen kumar veligeti
Why use .net by naveen kumar veligetiWhy use .net by naveen kumar veligeti
Why use .net by naveen kumar veligetiNaveen Kumar Veligeti
 
Page object from the ground up.ppt
Page object from the ground up.pptPage object from the ground up.ppt
Page object from the ground up.pptJoseph Beale
 

Similar to How To Crawl Amazon Website Using Python Scrapy.pdf (20)

Technical Utilities for your Site
Technical Utilities for your SiteTechnical Utilities for your Site
Technical Utilities for your Site
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
 
Create a new project in ROR
Create a new project in RORCreate a new project in ROR
Create a new project in ROR
 
django
djangodjango
django
 
Tutorial Solution
Tutorial SolutionTutorial Solution
Tutorial Solution
 
Introduction to google hacking database
Introduction to google hacking databaseIntroduction to google hacking database
Introduction to google hacking database
 
Introducing Placemaker
Introducing PlacemakerIntroducing Placemaker
Introducing Placemaker
 
Web technologies part-2
Web technologies part-2Web technologies part-2
Web technologies part-2
 
Web Techology and google code sh (2014_10_10 08_57_30 utc)
Web Techology and google code sh (2014_10_10 08_57_30 utc)Web Techology and google code sh (2014_10_10 08_57_30 utc)
Web Techology and google code sh (2014_10_10 08_57_30 utc)
 
Progressive EPiServer Development
Progressive EPiServer DevelopmentProgressive EPiServer Development
Progressive EPiServer Development
 
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
Gears User Guide
Gears User GuideGears User Guide
Gears User Guide
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Introduction to whats new in css3
Introduction to whats new in css3Introduction to whats new in css3
Introduction to whats new in css3
 
Why use .net by naveen kumar veligeti
Why use .net by naveen kumar veligetiWhy use .net by naveen kumar veligeti
Why use .net by naveen kumar veligeti
 
Page object from the ground up.ppt
Page object from the ground up.pptPage object from the ground up.ppt
Page object from the ground up.ppt
 

Recently uploaded

(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCRsoniya singh
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Timedelhimodelshub1
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...lizamodels9
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailAriel592675
 
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | DelhiFULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | DelhiMalviyaNagarCallGirl
 
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurVIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurSuhani Kapoor
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadAyesha Khan
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creationsnakalysalcedo61
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
A.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry BelcherA.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry BelcherPerry Belcher
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...lizamodels9
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCRsoniya singh
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfOrient Homes
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncrdollysharma2066
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth MarketingShawn Pang
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 

Recently uploaded (20)

(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Time
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detail
 
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | DelhiFULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Chhatarpur | Delhi
 
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurVIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creations
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
A.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry BelcherA.I. Bot Summit 3 Opening Keynote - Perry Belcher
A.I. Bot Summit 3 Opening Keynote - Perry Belcher
 
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 

How To Crawl Amazon Website Using Python Scrapy.pdf

  • 1. How To Crawl AmazonWebsite Using Python Scrapy? Crawl products from Amazon.com with Python Scrapy. Download Meta Information and Images for every item available in the pre-defined list. Project Overview Artificial Intelligence heavily relies on the data, and it's about quality. However, there is a big problem with lacking data in Machine Learning. AI needs specific amounts of data for practical analysis, performance, and training; if data adequacy is lacking, this won't become possible to complete a dependable project. Amazon's site is a collection of an extensive range of data kinds. This has high-quality data, including texts or images, electronic items, and all types of fashion products. All these could be very useful in making smaller-scale Machine Learning or Deep Learning Projects. With this project, we would crawl amazon.in the website to download images for different user input types. Our project will automatically make folders or download pictures and the meta.txt file for all items. In addition, this project could get used for scraping data from different web pages.
  • 2. Create a Spider imghow-to-crawl-amazon-website-using-python-scrapyCreate-a-Spider.jpg Let’s start the proceedings… Import Libraries We should be utilizing Python and Scrapy language with a few of the libraries with some knowledge about webpages. We would be creating a Spider from the Scrapy library for crawling a website.
  • 3. Now, let us define the constructor technique for a Spider class. The method asks a user to input categories that require to get searched separately using white spaces. All these categories would be searched using the spider on a website. We are producing a URL link for all items and save in the listing for easy access for a loop. We are utilizing a start_urls list for storing the starting URL and other_urls for storing the future URLs depending on user inputs. We have a counter variable for counting the complete URLs generated that will get used in the later procedures.
  • 4. After that, we shall outline a method named parse. The technique would request a webpage with a URL available in the start_urls listing and get a response in a response variable. We need to stipulate the complete path of directories where we can store the downloaded text and images. This needs to get joined using ‘/’ before ‘’ windows characteristics. Then, we shall get an item name with a response.url, find the character ‘=’, and scrape the URL substring. For a user’s input mobile:- The requested URL would be in the form of https://www.amazon.in/s?k=mobile, with the same URLuld be in a response t also. Therefore, we get an item name from there. Using the item name, we will make a directory within a parent directory. xpath is a named selector in Scrapy, and it is used for scraping certain information from the webpage.
  • 5. Here, we must understand that xpath would provide us data from the HTML webpage using tags, which we pass within. For instance, xpath(‘//h2/a/@href’) would give us information within the ‘href’ attribute within the ‘a’ tag in the ‘h2’ tag. We shall utilize the extract() function to get the information. Therefore, this information gives a link for the items that appear when the category gets searched, for example, Mobile. Here, we limit data to 20 items as we wish to get the initial 20 appearances. Correspondingly, we get item names under every category to make subfolders within the class. Here, we limit the item name listing to 20. The example category is Mobile, and the item names include Samsung Galaxy M33 5G (Color: Emerald Brown with 6GB & 128GB Storage), etc., and all_links would have url- links to those items. After that, we iterate over the all_links listing to find every item. After that, we perform specific alterations on item names items as they are long and have particular unwanted chars. After that, we continue yielding Requests till we finish the links. The requests object should send a response for parse_pages methods and pass a path like dictionary parameters. Other parameters are also important in the context of this project and are essential for smoother requests to domains.
  • 6. We have used a parse method for crawling the webpage having a category while the parse_pages process for crawling webpages for every item under the category. With the parse_pages method, we first retrieve text for every item by performing some operations using Regex (regular expressions) modification and xpath selection. The text modifications are essential to remove particular chars and find text in readable formats. After that, we write this text data in the file called meta.txt. After completing the project, you can get samples of the text files and how they get saved
  • 7. Using this code, we scrape image URLs that can be utilized to download and save in a local folder. As we could request a webpage and this information many times, we could be blocked from a domain. Therefore, we add the sleeping time of about 10 seconds if a domain server initially refuses the connection. We also use the Requests object to pull replies for the following URLs in the other_urls listing.
  • 8. The create_dir technique has been made to automatically make subfolders depending on the items under every category.
  • 9. Wonderful! Now, as we get the spider created and saved.Let's run that and see its results. To run a spider, we need to run an anaconda prompt or command prompt from a directory where this has a scrapy.cfg file. The configuration file for Scrapy and a spider Python file are available within the spiders' folder with this directory. Do a scrapy project using the spiders.py file. Although creating the new project using scrapy is easy, this could be looked at here. To run a spider, we have to run a command called scrapy crawl spider_name. Here, it might be scrapy crawling amazon_spider. Run that from the directory in which the scrapy.cfg file is available as given below:-
  • 10. Now, you can see the data coming, like bot names and more. However, the quick stop once that outputs, “Enter items you need to search separated by spaces: ” … Now the time is there to enter every category you need to search; in the case here — “mobile tv t-shirts” all are separated by the spaces. Press enters it; now the prompt would output different information, mainly having the requests sent with the amazon.com domain. The graphical changes you could notice in the directory where you get specified in saving files. Here, it was the Crawled_Items folder. Wonderful! Our output would soon become ready! Let’s get some glimpses of the downloaded content…. Folder created for every category.
  • 11. Under every category, one bunch of folders for every items at maximum 20. Under every item, some images with meta.txt file having informational features of items.
  • 12. Meta.txt file having information:- Therefore, this marks the end of the project Happy Coding! For more information, contact iWeb Data Scraping. Contact us for web scraping and mobile app scraping service requirements!