SlideShare a Scribd company logo
WEB CRAWLER AND
SCRAPER
WEB CRAWLER
What is web crawler?
 A crawler is a program that visits Web sites and reads their pages and other
information in order to create entries for a search engine index. The major
search engines on the Web all have such a program, which is also known as a
"spider" or a "bot." Crawlers are typically programmed to visit sites that have
been submitted by their owners as new or updated. Entire sites or specific
pages can be selectively visited and indexed. Crawlers apparently gained the
name because they crawl through a site a page at a time, following the links to
other pages on the site until all pages have been read.
 But now? Not anymore :)
WEB SCRAPING
Web Scraping refers to an application that processes the HTML of a Web
page to extract data for manipulation such as converting the Web
page to another format (i.e. HTML to WML). Web Scraping scripts and
applications will simulate a person viewing a Web site with a browser.
With these scripts you can connect to a Web page and request a
page, exactly as a browser would do. The Web server will send back
the page which you can then manipulate or extract specific
information from.
Also known as Data or information mining
WHAT FOR?
 Used for Data Mining
 Copying contents (without permission) on a website
 Real states Website or application
 Online store price comparison,
 Website or app that suggest information
 Used for SEO
 Check Google, Yahoo, Bing daily rank or position on search engine
results.
 Link builder or dropper (high-breed)
 Spammer (high-breed)
 Automated account creator (high-breed)
LANGUAGES AND TOOLS
 Tools
 Ubotstudio – GUI base but you can also do your stuff in coding.
 Languages
 Phantomjs – know as one of the headless javascript
 Casperjs - know as one of the headless javascript
 PHP
 Python
 Perl
 Etc.

More Related Content

What's hot

The Google Guide
The Google GuideThe Google Guide
The Google Guide
guestb255e23
 
How to audit Website In SEO
How to audit Website In SEOHow to audit Website In SEO
How to audit Website In SEO
jigneshbhalu101
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebSmart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
S Sai Karthik
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
Web search Technologies
Web search TechnologiesWeb search Technologies
Web search Technologies
Abdul Sami Kharal
 
Seo and analytics basics
Seo and analytics basicsSeo and analytics basics
Seo and analytics basics
Sreekanth Narayanan
 
Site Audits in 10 Minutes
Site Audits in 10 MinutesSite Audits in 10 Minutes
Site Audits in 10 Minutes
Jon Quinton
 
Sitemap. SEO, And Backlink
Sitemap. SEO, And BacklinkSitemap. SEO, And Backlink
Sitemap. SEO, And Backlink
Chetan Patil
 
Search engine
Search engineSearch engine
Search engine
Wasif Khan
 
How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)
Ali Saif Mirza
 
SEO Tools
SEO ToolsSEO Tools
SEO Tools
chintanchheda
 
prestiva_blackhat
prestiva_blackhatprestiva_blackhat
prestiva_blackhat
Murali Venkatesh
 
How developer's can help seo
How developer's can help seo How developer's can help seo
How developer's can help seo
Gunjan Srivastava
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
wwwhisper Tool Review
wwwhisper Tool Reviewwwwhisper Tool Review
wwwhisper Tool Review
Meg Nicol
 
Advanced SEO through multiple XML sitemaps
Advanced SEO through multiple XML sitemapsAdvanced SEO through multiple XML sitemaps
Advanced SEO through multiple XML sitemaps
Laurent Müllender
 
On page Optimization
On page OptimizationOn page Optimization
On page Optimization
Web Development Montreal
 
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo RamazzottiCrawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Gimasi Sa
 

What's hot (18)

The Google Guide
The Google GuideThe Google Guide
The Google Guide
 
How to audit Website In SEO
How to audit Website In SEOHow to audit Website In SEO
How to audit Website In SEO
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebSmart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
Web search Technologies
Web search TechnologiesWeb search Technologies
Web search Technologies
 
Seo and analytics basics
Seo and analytics basicsSeo and analytics basics
Seo and analytics basics
 
Site Audits in 10 Minutes
Site Audits in 10 MinutesSite Audits in 10 Minutes
Site Audits in 10 Minutes
 
Sitemap. SEO, And Backlink
Sitemap. SEO, And BacklinkSitemap. SEO, And Backlink
Sitemap. SEO, And Backlink
 
Search engine
Search engineSearch engine
Search engine
 
How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)How search engine works ( Mr. Mirza)
How search engine works ( Mr. Mirza)
 
SEO Tools
SEO ToolsSEO Tools
SEO Tools
 
prestiva_blackhat
prestiva_blackhatprestiva_blackhat
prestiva_blackhat
 
How developer's can help seo
How developer's can help seo How developer's can help seo
How developer's can help seo
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
wwwhisper Tool Review
wwwhisper Tool Reviewwwwhisper Tool Review
wwwhisper Tool Review
 
Advanced SEO through multiple XML sitemaps
Advanced SEO through multiple XML sitemapsAdvanced SEO through multiple XML sitemaps
Advanced SEO through multiple XML sitemaps
 
On page Optimization
On page OptimizationOn page Optimization
On page Optimization
 
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo RamazzottiCrawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
 

Similar to 1ST TECH TALK: Web Crawler and Scraper by Abaam Germones

Challenges in web crawling
Challenges in web crawlingChallenges in web crawling
Challenges in web crawling
Burhan Ahmed
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
iamthevictory
 
Search engine
Search engineSearch engine
Search engine
Chinmay Patel
 
Glossary of Digital Terms
Glossary of Digital TermsGlossary of Digital Terms
Glossary of Digital Terms
Laura Kerrigan
 
Glossary of Digital Terms
Glossary of Digital TermsGlossary of Digital Terms
Glossary of Digital Terms
Laura Kerrigan
 
Week12presentation
Week12presentationWeek12presentation
Week12presentation
yuki0722_0007
 
Week12presentation
Week12presentationWeek12presentation
Week12presentation
s1160001
 
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo RamazzottiIl processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Paolo Ramazzotti
 
Seo by Google
Seo by GoogleSeo by Google
Seo by Google
Nikul Patel
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
Akshay Pratap Singh
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
waqas ahmad
 
Seo beginners
Seo beginners Seo beginners
Seo beginners
Health Care
 
Automotive Search Engine Optimization (SEO) Basics
Automotive Search Engine Optimization (SEO) BasicsAutomotive Search Engine Optimization (SEO) Basics
Automotive Search Engine Optimization (SEO) Basics
Social Media Marketing
 
Seo beginners-slide-show
Seo beginners-slide-showSeo beginners-slide-show
Seo beginners-slide-show
Ankush77721
 
Seo beginners-slide-show
Seo beginners-slide-showSeo beginners-slide-show
Seo beginners-slide-show
Krunal Doshi
 
Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012
Jessica Dunbar
 
Search Engine Marketing | Top Search Engines | Search Engines List
Search Engine Marketing | Top Search Engines | Search Engines ListSearch Engine Marketing | Top Search Engines | Search Engines List
Search Engine Marketing | Top Search Engines | Search Engines List
paulfrench999
 
Search engine
Search engineSearch engine
Search engine
Alisha Korpal
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web Crawler
IJMER
 
Digital Marketing: Glossary Of Common Terms and Phrases
Digital Marketing: Glossary Of Common Terms and PhrasesDigital Marketing: Glossary Of Common Terms and Phrases
Digital Marketing: Glossary Of Common Terms and Phrases
TinderPoint
 

Similar to 1ST TECH TALK: Web Crawler and Scraper by Abaam Germones (20)

Challenges in web crawling
Challenges in web crawlingChallenges in web crawling
Challenges in web crawling
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Search engine
Search engineSearch engine
Search engine
 
Glossary of Digital Terms
Glossary of Digital TermsGlossary of Digital Terms
Glossary of Digital Terms
 
Glossary of Digital Terms
Glossary of Digital TermsGlossary of Digital Terms
Glossary of Digital Terms
 
Week12presentation
Week12presentationWeek12presentation
Week12presentation
 
Week12presentation
Week12presentationWeek12presentation
Week12presentation
 
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo RamazzottiIl processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
 
Seo by Google
Seo by GoogleSeo by Google
Seo by Google
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
 
Seo beginners
Seo beginners Seo beginners
Seo beginners
 
Automotive Search Engine Optimization (SEO) Basics
Automotive Search Engine Optimization (SEO) BasicsAutomotive Search Engine Optimization (SEO) Basics
Automotive Search Engine Optimization (SEO) Basics
 
Seo beginners-slide-show
Seo beginners-slide-showSeo beginners-slide-show
Seo beginners-slide-show
 
Seo beginners-slide-show
Seo beginners-slide-showSeo beginners-slide-show
Seo beginners-slide-show
 
Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012
 
Search Engine Marketing | Top Search Engines | Search Engines List
Search Engine Marketing | Top Search Engines | Search Engines ListSearch Engine Marketing | Top Search Engines | Search Engines List
Search Engine Marketing | Top Search Engines | Search Engines List
 
Search engine
Search engineSearch engine
Search engine
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web Crawler
 
Digital Marketing: Glossary Of Common Terms and Phrases
Digital Marketing: Glossary Of Common Terms and PhrasesDigital Marketing: Glossary Of Common Terms and Phrases
Digital Marketing: Glossary Of Common Terms and Phrases
 

Recently uploaded

Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 

Recently uploaded (20)

Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.Types of Herbal Cosmetics its standardization.
Types of Herbal Cosmetics its standardization.
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 

1ST TECH TALK: Web Crawler and Scraper by Abaam Germones

  • 2. WEB CRAWLER What is web crawler?  A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot." Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. Crawlers apparently gained the name because they crawl through a site a page at a time, following the links to other pages on the site until all pages have been read.  But now? Not anymore :)
  • 3. WEB SCRAPING Web Scraping refers to an application that processes the HTML of a Web page to extract data for manipulation such as converting the Web page to another format (i.e. HTML to WML). Web Scraping scripts and applications will simulate a person viewing a Web site with a browser. With these scripts you can connect to a Web page and request a page, exactly as a browser would do. The Web server will send back the page which you can then manipulate or extract specific information from. Also known as Data or information mining
  • 4. WHAT FOR?  Used for Data Mining  Copying contents (without permission) on a website  Real states Website or application  Online store price comparison,  Website or app that suggest information  Used for SEO  Check Google, Yahoo, Bing daily rank or position on search engine results.  Link builder or dropper (high-breed)  Spammer (high-breed)  Automated account creator (high-breed)
  • 5. LANGUAGES AND TOOLS  Tools  Ubotstudio – GUI base but you can also do your stuff in coding.  Languages  Phantomjs – know as one of the headless javascript  Casperjs - know as one of the headless javascript  PHP  Python  Perl  Etc.