SlideShare a Scribd company logo
1 of 15
CHALLENGES IN WEB CRAWLING
WEB CRAWLER
Web crawler (also known in other terms like ants, automatic
indexers, bots, web spiders, web robots) is an automated
program, or script, that methodically scans or “crawls”
through web pages to create an index of the data it is set to
look for. This process is called Web crawling or spidering.
CRAWLER
A crawler is a program that visits Web sites and reads their
pages and other information in order to create entries for
a search engine index. The major search engines on the
Web all have such a program, which is also known as a
"spider" or a "bot." Crawlers are typically programmed to
visit sites that have been submitted by their owners as new
or updated.
HOW A WEB CRAWLER WORKS
The world wide web is full of information. If you want to know
something, you can probably find the information online. But
how can you find the answer you want, when the web contains
trillions of pages? How do you know where to look?
Fortunately, we have search engines to do the looking for us.
But how do search engines know where to look? How can
search engines recommend a few pages out of the trillions that
exist? The answer lies with web crawlers.
HOW A WEB CRAWLER WORKS
Crawlers scan web pages to see what words they contain,
and where those words are used. The crawler turns its
findings into a giant index. The index is basically a big list of
words and the web pages that feature them. So when you
ask a search engine for pages about hippos, the search
engine checks its index and gives you a list of pages that
mention hippos. Web crawlers scan the web regularly so
they always have an up-to-date index of the web.
THE SEO IMPLICATIONS OF WEB CRAWLERS
Now that you know how a web crawler works, you can see
that the behavior of the web crawler has implications for
how you optimize your website.
For example, you can see that, if you sell parachutes, it’s
important that you write about parachutes on your website.
If you don’t write about parachutes, search engines will
never suggest your website to people searching for
parachutes.
THE SEO IMPLICATIONS OF WEB CRAWLERS
It’s also important to note that web crawlers don’t just pay attention to
what words they find – they also record where the words are found. So
the web crawler knows that a word contained in headings, meta data
and the first few sentences are likely to be more important in the context
of the page, and that keywords in prime locations suggest that the page
is really ‘about’ those keywords.
So if you want search engines to know that parachutes are a big deal on
your website, mention them in your headings, meta data and opening
sentences.
The fact that web crawlers regularly trawl the web to make sure their
index is up to date also suggests that having fresh content on your
website is a good thing too.
SEARCH ENGINE INDEXES
Once the crawler has found information by crawling over the web, the
program builds the index. The index is essentially a big list of all the
words the crawler has found, as well as their location.
CHALLENGES IN WEB CRAWLING
• Challenge I: Non-Uniform Structures
• Challenge II: Omnipresence of AJAX elements
• Challenge III: The “Real” Real-Time Latency
• Challenge IV: Who owns UGC?
CHALLENGE I: NON-UNIFORM STRUCTURES
Data formats and structures are inconsistent in the ever-evolving Web space.
Also, norms on how to build an Internet presence are non-existent.
The result?
Lack of uniformity and the vast ever-changing terrains of the Internet.
The problem?
Collecting data in a machine-readable format becomes difficult. Also,
problems increase with increase in scale.
Especially, when:
a) structured data is needed, and,
b) large number of details are to be extracted w.r.t. specific schema from
multiple sources.
CHALLENGE II: OMNIPRESENCE OF AJAX ELEMENTS
AJAX and interactive web components make websites more user-friendly. But
not for crawlers!
The result?
Content is produced dynamically (and on-the-go) by the browser and
therefore not visible to crawlers.
The problem?
To keep the content up-to-date, the crawler needs to be maintained manually
on a regular basis. So much so, that even Google’s crawlers find it difficult to
extract information!
The solution?
Crawlers need to be refined in their approach to be more efficient and
scalable.
CHALLENGE III: THE “REAL” REAL-TIME LATENCY
Acquiring data-sets in real-time is a huge problem! Real-time data is
critical in security and intelligence to predict, report, and enable
preemptive actions against untoward incidents.
The problem?
The real problem comes in deciding what is and isn't important in real
time.
CHALLENGE IV: WHO OWNS UGC?
User-Generated Content (UGC) proprietorship is claimed by giants
like Craigslist and Yelp and is usually out-of-bounds for commercial
crawlers.
The result?
Only 2-3 % sites disallow bots. Others believe in data democratization,
but it is possible these may follow suit and shut access to the data gold
mine!
The problem?
Site policing for web scraping and rejecting bots.
THANK YOU!

More Related Content

What's hot

Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Tangible Interaction & Interfaces
Tangible Interaction & InterfacesTangible Interaction & Interfaces
Tangible Interaction & InterfacesMarie Quién
 
Internship Title Defense
Internship Title DefenseInternship Title Defense
Internship Title DefenseMd.Sumon Sarder
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and IndexingHimani Tyagi
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal
 
Web Development Presentation
Web Development PresentationWeb Development Presentation
Web Development PresentationTurnToTech
 
Introduction To Single Page Application
Introduction To Single Page ApplicationIntroduction To Single Page Application
Introduction To Single Page ApplicationKMS Technology
 
Website Development Process
Website Development ProcessWebsite Development Process
Website Development ProcessHend Al-Khalifa
 
Multiple Object Tracking
Multiple Object TrackingMultiple Object Tracking
Multiple Object TrackingRainakSharma
 
ppt of web development for diploma student
ppt of web development for diploma student ppt of web development for diploma student
ppt of web development for diploma student Abhishekchauhan863165
 
WCAG 2.1 and POUR
WCAG 2.1 and POURWCAG 2.1 and POUR
WCAG 2.1 and POURAlena Huang
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approachSyed Islam
 
Virtual Reality and Augmented Reality
Virtual Reality and Augmented RealityVirtual Reality and Augmented Reality
Virtual Reality and Augmented RealityThembuluwo Radzilani
 
Online shopping system
Online shopping systemOnline shopping system
Online shopping systemNik_Panchal
 

What's hot (20)

Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Web mining
Web miningWeb mining
Web mining
 
Tangible Interaction & Interfaces
Tangible Interaction & InterfacesTangible Interaction & Interfaces
Tangible Interaction & Interfaces
 
Internship Title Defense
Internship Title DefenseInternship Title Defense
Internship Title Defense
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
Web Development Presentation
Web Development PresentationWeb Development Presentation
Web Development Presentation
 
Web mining
Web miningWeb mining
Web mining
 
Introduction To Single Page Application
Introduction To Single Page ApplicationIntroduction To Single Page Application
Introduction To Single Page Application
 
Website Development Process
Website Development ProcessWebsite Development Process
Website Development Process
 
Multiple Object Tracking
Multiple Object TrackingMultiple Object Tracking
Multiple Object Tracking
 
ppt of web development for diploma student
ppt of web development for diploma student ppt of web development for diploma student
ppt of web development for diploma student
 
Wordpress ppt
Wordpress pptWordpress ppt
Wordpress ppt
 
Hologram
HologramHologram
Hologram
 
WCAG 2.1 and POUR
WCAG 2.1 and POURWCAG 2.1 and POUR
WCAG 2.1 and POUR
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approach
 
Virtual Reality and Augmented Reality
Virtual Reality and Augmented RealityVirtual Reality and Augmented Reality
Virtual Reality and Augmented Reality
 
Augmented Reality (AR)
Augmented Reality (AR)Augmented Reality (AR)
Augmented Reality (AR)
 
Online shopping system
Online shopping systemOnline shopping system
Online shopping system
 
Search Engine
Search EngineSearch Engine
Search Engine
 

Similar to Challenges in web crawling

1ST TECH TALK: Web Crawler and Scraper by Abaam Germones
1ST TECH TALK: Web Crawler and Scraper by Abaam Germones1ST TECH TALK: Web Crawler and Scraper by Abaam Germones
1ST TECH TALK: Web Crawler and Scraper by Abaam GermonesBicol IT.org
 
How developer's can help seo
How developer's can help seo How developer's can help seo
How developer's can help seo Gunjan Srivastava
 
The ultimate guide to the invisible web
The ultimate guide to the invisible webThe ultimate guide to the invisible web
The ultimate guide to the invisible webYKNIB O
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreBarbaraStarr2009
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningIJMTST Journal
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018STELIANCREANGA
 
How search engine works
How search engine worksHow search engine works
How search engine worksAshraf Ali
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerIJMER
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

Similar to Challenges in web crawling (20)

1ST TECH TALK: Web Crawler and Scraper by Abaam Germones
1ST TECH TALK: Web Crawler and Scraper by Abaam Germones1ST TECH TALK: Web Crawler and Scraper by Abaam Germones
1ST TECH TALK: Web Crawler and Scraper by Abaam Germones
 
How developer's can help seo
How developer's can help seo How developer's can help seo
How developer's can help seo
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
The ultimate guide to the invisible web
The ultimate guide to the invisible webThe ultimate guide to the invisible web
The ultimate guide to the invisible web
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
 
E3602042044
E3602042044E3602042044
E3602042044
 
Search engine
Search engineSearch engine
Search engine
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
 
unit 2.pptx
unit 2.pptxunit 2.pptx
unit 2.pptx
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Day 7
Day 7Day 7
Day 7
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
SEO Interview FAQ
SEO Interview FAQSEO Interview FAQ
SEO Interview FAQ
 
How search engine works
How search engine worksHow search engine works
How search engine works
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web Crawler
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

More from Burhan Ahmed

Wireless mobile communication
Wireless mobile communicationWireless mobile communication
Wireless mobile communicationBurhan Ahmed
 
Uses misuses and risk of software
Uses misuses and risk of softwareUses misuses and risk of software
Uses misuses and risk of softwareBurhan Ahmed
 
The distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
The distinction of prophet muhammad (s.a.w) among the teachers of moral conductThe distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
The distinction of prophet muhammad (s.a.w) among the teachers of moral conductBurhan Ahmed
 
Software house organization
Software house organizationSoftware house organization
Software house organizationBurhan Ahmed
 
Social interaction
Social interactionSocial interaction
Social interactionBurhan Ahmed
 
Planning work activities
Planning work activitiesPlanning work activities
Planning work activitiesBurhan Ahmed
 
Peripheral devices
Peripheral devicesPeripheral devices
Peripheral devicesBurhan Ahmed
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
Operator overloading
Operator overloadingOperator overloading
Operator overloadingBurhan Ahmed
 
Job analysis and job design
Job analysis and job designJob analysis and job design
Job analysis and job designBurhan Ahmed
 
Intellectual property
Intellectual propertyIntellectual property
Intellectual propertyBurhan Ahmed
 

More from Burhan Ahmed (20)

Wireless mobile communication
Wireless mobile communicationWireless mobile communication
Wireless mobile communication
 
Virtual function
Virtual functionVirtual function
Virtual function
 
Uses misuses and risk of software
Uses misuses and risk of softwareUses misuses and risk of software
Uses misuses and risk of software
 
Types of computer
Types of computerTypes of computer
Types of computer
 
Trees
TreesTrees
Trees
 
Topology
TopologyTopology
Topology
 
The distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
The distinction of prophet muhammad (s.a.w) among the teachers of moral conductThe distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
The distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
 
Software house organization
Software house organizationSoftware house organization
Software house organization
 
Social interaction
Social interactionSocial interaction
Social interaction
 
Role model
Role modelRole model
Role model
 
Rights and duties
Rights and dutiesRights and duties
Rights and duties
 
Planning work activities
Planning work activitiesPlanning work activities
Planning work activities
 
Peripheral devices
Peripheral devicesPeripheral devices
Peripheral devices
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Operator overloading
Operator overloadingOperator overloading
Operator overloading
 
Normalization
NormalizationNormalization
Normalization
 
Managing strategy
Managing strategyManaging strategy
Managing strategy
 
Letter writing
Letter writingLetter writing
Letter writing
 
Job analysis and job design
Job analysis and job designJob analysis and job design
Job analysis and job design
 
Intellectual property
Intellectual propertyIntellectual property
Intellectual property
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 

Challenges in web crawling

  • 1.
  • 2. CHALLENGES IN WEB CRAWLING
  • 3. WEB CRAWLER Web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.
  • 4. CRAWLER A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot." Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated.
  • 5. HOW A WEB CRAWLER WORKS The world wide web is full of information. If you want to know something, you can probably find the information online. But how can you find the answer you want, when the web contains trillions of pages? How do you know where to look? Fortunately, we have search engines to do the looking for us. But how do search engines know where to look? How can search engines recommend a few pages out of the trillions that exist? The answer lies with web crawlers.
  • 6. HOW A WEB CRAWLER WORKS Crawlers scan web pages to see what words they contain, and where those words are used. The crawler turns its findings into a giant index. The index is basically a big list of words and the web pages that feature them. So when you ask a search engine for pages about hippos, the search engine checks its index and gives you a list of pages that mention hippos. Web crawlers scan the web regularly so they always have an up-to-date index of the web.
  • 7. THE SEO IMPLICATIONS OF WEB CRAWLERS Now that you know how a web crawler works, you can see that the behavior of the web crawler has implications for how you optimize your website. For example, you can see that, if you sell parachutes, it’s important that you write about parachutes on your website. If you don’t write about parachutes, search engines will never suggest your website to people searching for parachutes.
  • 8. THE SEO IMPLICATIONS OF WEB CRAWLERS It’s also important to note that web crawlers don’t just pay attention to what words they find – they also record where the words are found. So the web crawler knows that a word contained in headings, meta data and the first few sentences are likely to be more important in the context of the page, and that keywords in prime locations suggest that the page is really ‘about’ those keywords. So if you want search engines to know that parachutes are a big deal on your website, mention them in your headings, meta data and opening sentences. The fact that web crawlers regularly trawl the web to make sure their index is up to date also suggests that having fresh content on your website is a good thing too.
  • 9. SEARCH ENGINE INDEXES Once the crawler has found information by crawling over the web, the program builds the index. The index is essentially a big list of all the words the crawler has found, as well as their location.
  • 10. CHALLENGES IN WEB CRAWLING • Challenge I: Non-Uniform Structures • Challenge II: Omnipresence of AJAX elements • Challenge III: The “Real” Real-Time Latency • Challenge IV: Who owns UGC?
  • 11. CHALLENGE I: NON-UNIFORM STRUCTURES Data formats and structures are inconsistent in the ever-evolving Web space. Also, norms on how to build an Internet presence are non-existent. The result? Lack of uniformity and the vast ever-changing terrains of the Internet. The problem? Collecting data in a machine-readable format becomes difficult. Also, problems increase with increase in scale. Especially, when: a) structured data is needed, and, b) large number of details are to be extracted w.r.t. specific schema from multiple sources.
  • 12. CHALLENGE II: OMNIPRESENCE OF AJAX ELEMENTS AJAX and interactive web components make websites more user-friendly. But not for crawlers! The result? Content is produced dynamically (and on-the-go) by the browser and therefore not visible to crawlers. The problem? To keep the content up-to-date, the crawler needs to be maintained manually on a regular basis. So much so, that even Google’s crawlers find it difficult to extract information! The solution? Crawlers need to be refined in their approach to be more efficient and scalable.
  • 13. CHALLENGE III: THE “REAL” REAL-TIME LATENCY Acquiring data-sets in real-time is a huge problem! Real-time data is critical in security and intelligence to predict, report, and enable preemptive actions against untoward incidents. The problem? The real problem comes in deciding what is and isn't important in real time.
  • 14. CHALLENGE IV: WHO OWNS UGC? User-Generated Content (UGC) proprietorship is claimed by giants like Craigslist and Yelp and is usually out-of-bounds for commercial crawlers. The result? Only 2-3 % sites disallow bots. Others believe in data democratization, but it is possible these may follow suit and shut access to the data gold mine! The problem? Site policing for web scraping and rejecting bots.