SlideShare a Scribd company logo
THE CHALLENGES IN CRAWLING THE WEB.
As an ever-evolving field, extracting data from the web is still a
gray area.
No clear ground rules regarding the legality of web scraping
exists!
The concern over privacy issues on collecting data off the Web is
growing.
People are wary about how data is or can be used.
Increasingly, Big Data is being frowned upon.
Its harvesting, even more so!
Yet, undeniably, data crawling is growing exponentially.
As it grows, the Web is gradually becoming more
complicated to crawl.
CHALLENGE I
NON-UNIFORM STRUCTURES
Data formats & structures are inconsistent in the Web space.
Norms on how to build an Internet presence are non-existent.
The result?
Lack of uniformity and the vast ever-changing
terrains of the Internet.
The problem?
Collecting data in a machine-readable format
becomes difficult.
Problems increase with increase in scale!
Especially, when:
a) structured data is needed, and,
b) large number of details are to be extracted from
multiple sources.
CHALLENGE II
OMNIPRESENCE OF AJAX ELEMENTS
AJAX and interactive web components make websites more
user-friendly. But not for crawlers!
The result?
Content is produced dynamically (and on-the-go) by the
browser and therefore not visible to crawlers.
The problem?
To keep the content up-to-date, the crawler needs to be
maintained manually on a regular basis.
Even Google’s crawlers find it difficult to extract information!
Crawlers need to be refined in their approach to be more
efficient and scalable. We have a solution that makes crawling
AJAX pages prompt. Click here.
CHALLENGE III
THE “REAL” REAL-TIME LATENCY
Acquiring data-sets in real-time is a huge problem! Real-time
data is critical in security and intelligence to predict, report,
and enable preemptive actions.
The result?
While near-real-time is achieved, real-time latency
remains the Holy Grail.
The problem?
The real problem comes in deciding
what is and isn't important in real time.
CHALLENGE IV
WHO OWNS UGC?
User-Generated Content (UGC) proprietorship is claimed by
giants like Craigslist and Yelp and is usually out-of-bounds for
commercial crawlers.
The result?
Only 2-3 % sites disallow bots. Others believe in data
democratization, but it is possible these may follow suit
and shut access to the data gold mine!
The problem?
Site policing for web scraping and rejecting bots.
CHALLENGE V
THE RISE OF ANTI-SCRAPING TOOLS
Tools like ScrapeDefender, ScrapeShield, ScrapeSentry are
capable of differentiating bots from humans.
The result?
Restriction on web crawlers via e-mail obfuscation, real-
time monitoring, and instant alerts etc.
The problem?
This is <1%, yet it may rise; all thanks to rogue crawlers,
responsible for multiple hits on target servers.
DDoS becomes unavoidable!
Web data is a vast uncharted territory full of bounty, and
having the proper tools helps.
So does knowing how to use them since there exists a very
thin line between being ‘crawlers’ and ‘hackers’.
And this is where the genuine concern for privacy arises.
At PromptCloud, these crawling challenges are met head-on.
Our two ground rules we recommend that every web-crawling
solution should follow.
COURTESY
In our experience, a little courtesy goes a long way.
Burdening small servers and causing DDoS on target sites is
easy.
Yet it is detrimental to the success of any company – especially
small businesses!
Rule #1 is to allow at least an interval of 2 seconds in
successive requests.
This helps avoid hitting servers too hard.
CRAWLABILITY
Many (and most) websites restrict the amount of data (either
sections of the site or complete sites) that can be crawled by
spiders via the robots.txt file.
Rule #2 is to establish feasibility of such site(s)!
It helps greatly to check the site’s policy on bots — whether it
allows bots in target sections from where data is desired.
The Challenges in Crawling the Web

More Related Content

What's hot

What is Web3 0
What is Web3 0What is Web3 0
What is Web3 0
stacytai
 
Web 4.0 and beyond
Web 4.0 and beyondWeb 4.0 and beyond
Web 4.0 and beyondJohan Koren
 
Semantic web 3.0 - Direction First
Semantic web 3.0 - Direction FirstSemantic web 3.0 - Direction First
Semantic web 3.0 - Direction FirstErica van Lieven
 
Web 3.0 and What It Means to Marketing
Web 3.0 and What It Means to MarketingWeb 3.0 and What It Means to Marketing
Web 3.0 and What It Means to Marketing
Magic Logix
 
Microsoft Power Point Lib1 #1262264 V1 Social Networking
Microsoft Power Point   Lib1 #1262264 V1 Social NetworkingMicrosoft Power Point   Lib1 #1262264 V1 Social Networking
Microsoft Power Point Lib1 #1262264 V1 Social Networking
tmdomish
 
Privacy of facebook
Privacy of facebookPrivacy of facebook
Privacy of facebookhernan_j1
 
Web 4.0 and beyond?
Web 4.0 and beyond?Web 4.0 and beyond?
Web 4.0 and beyond?Johan Koren
 
What is Web 3.0?
What is Web 3.0?What is Web 3.0?
What is Web 3.0?Johan Koren
 
Web 1 2 3
Web 1 2 3Web 1 2 3
Web 1 2 3
londoncall
 
Comvigo IM Lock WhitePaper
Comvigo IM Lock WhitePaperComvigo IM Lock WhitePaper
Comvigo IM Lock WhitePaper
James Tanner
 
Amaxus con webdoc_10773
Amaxus con webdoc_10773Amaxus con webdoc_10773
Amaxus con webdoc_10773vafopoulos
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
Phil Cryer
 
Web 3.0 :The Evolution of Web
Web 3.0:The Evolution of WebWeb 3.0:The Evolution of Web
Web 3.0 :The Evolution of Web
Niharjyoti Sarangi
 
Generations of web 1.0, 2.0 and 3.0
Generations of web 1.0, 2.0 and 3.0Generations of web 1.0, 2.0 and 3.0
Generations of web 1.0, 2.0 and 3.0
ShamsReza2
 
Web 4.0 and beyond
Web 4.0 and beyondWeb 4.0 and beyond
Web 4.0 and beyondJohan Koren
 
Internet privacy presentation
Internet privacy presentationInternet privacy presentation
Internet privacy presentationMatthew Momney
 
Document of presentation(web 3.0)(part 2)
Document of presentation(web 3.0)(part 2)Document of presentation(web 3.0)(part 2)
Document of presentation(web 3.0)(part 2)
Abhishek Roy
 
What is Web 3.0?
What is Web 3.0?What is Web 3.0?
What is Web 3.0?Johan Koren
 

What's hot (20)

What is Web3 0
What is Web3 0What is Web3 0
What is Web3 0
 
Web 4.0 and beyond
Web 4.0 and beyondWeb 4.0 and beyond
Web 4.0 and beyond
 
Semantic web 3.0 - Direction First
Semantic web 3.0 - Direction FirstSemantic web 3.0 - Direction First
Semantic web 3.0 - Direction First
 
Web 3.0 and What It Means to Marketing
Web 3.0 and What It Means to MarketingWeb 3.0 and What It Means to Marketing
Web 3.0 and What It Means to Marketing
 
Microsoft Power Point Lib1 #1262264 V1 Social Networking
Microsoft Power Point   Lib1 #1262264 V1 Social NetworkingMicrosoft Power Point   Lib1 #1262264 V1 Social Networking
Microsoft Power Point Lib1 #1262264 V1 Social Networking
 
Privacy of facebook
Privacy of facebookPrivacy of facebook
Privacy of facebook
 
Web 4.0 and beyond?
Web 4.0 and beyond?Web 4.0 and beyond?
Web 4.0 and beyond?
 
What is Web 3.0?
What is Web 3.0?What is Web 3.0?
What is Web 3.0?
 
Web 1 2 3
Web 1 2 3Web 1 2 3
Web 1 2 3
 
Comvigo IM Lock WhitePaper
Comvigo IM Lock WhitePaperComvigo IM Lock WhitePaper
Comvigo IM Lock WhitePaper
 
Amaxus con webdoc_10773
Amaxus con webdoc_10773Amaxus con webdoc_10773
Amaxus con webdoc_10773
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
 
Web 3.0?
Web 3.0?Web 3.0?
Web 3.0?
 
Web 3.0 :The Evolution of Web
Web 3.0:The Evolution of WebWeb 3.0:The Evolution of Web
Web 3.0 :The Evolution of Web
 
Generations of web 1.0, 2.0 and 3.0
Generations of web 1.0, 2.0 and 3.0Generations of web 1.0, 2.0 and 3.0
Generations of web 1.0, 2.0 and 3.0
 
Web 4.0 and beyond
Web 4.0 and beyondWeb 4.0 and beyond
Web 4.0 and beyond
 
Internet privacy presentation
Internet privacy presentationInternet privacy presentation
Internet privacy presentation
 
Document of presentation(web 3.0)(part 2)
Document of presentation(web 3.0)(part 2)Document of presentation(web 3.0)(part 2)
Document of presentation(web 3.0)(part 2)
 
What is Web 3.0?
What is Web 3.0?What is Web 3.0?
What is Web 3.0?
 
The future of the web 4.0: the odyssey
The future of the web 4.0: the odyssey The future of the web 4.0: the odyssey
The future of the web 4.0: the odyssey
 

Viewers also liked

The Stellar Science 2.0 Mash-UP Infrastructure
The Stellar Science 2.0 Mash-UP InfrastructureThe Stellar Science 2.0 Mash-UP Infrastructure
The Stellar Science 2.0 Mash-UP Infrastructure
Thomas Ullmann
 
Security and Trust in social media networks
Security and Trust in social media networksSecurity and Trust in social media networks
Security and Trust in social media networks
Touradj Ebrahimi
 
Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010
Valentini Mellas
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Steve Watt
 
Web2.0 Applications
Web2.0 ApplicationsWeb2.0 Applications
Web2.0 Applications
domenico79
 
Web Credibility - BJ Fogg - Stanford University
Web Credibility - BJ Fogg - Stanford UniversityWeb Credibility - BJ Fogg - Stanford University
Web Credibility - BJ Fogg - Stanford University
BJ Fogg
 

Viewers also liked (6)

The Stellar Science 2.0 Mash-UP Infrastructure
The Stellar Science 2.0 Mash-UP InfrastructureThe Stellar Science 2.0 Mash-UP Infrastructure
The Stellar Science 2.0 Mash-UP Infrastructure
 
Security and Trust in social media networks
Security and Trust in social media networksSecurity and Trust in social media networks
Security and Trust in social media networks
 
Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Web2.0 Applications
Web2.0 ApplicationsWeb2.0 Applications
Web2.0 Applications
 
Web Credibility - BJ Fogg - Stanford University
Web Credibility - BJ Fogg - Stanford UniversityWeb Credibility - BJ Fogg - Stanford University
Web Credibility - BJ Fogg - Stanford University
 

Similar to The Challenges in Crawling the Web

A Comprehensive Guide to Web 3.0 Development Companies.
A Comprehensive Guide to Web 3.0 Development Companies.A Comprehensive Guide to Web 3.0 Development Companies.
A Comprehensive Guide to Web 3.0 Development Companies.
Techugo
 
Challenges and Risks of Web 3.0 — A New Digital World Order
Challenges and Risks of Web 3.0 — A New Digital World OrderChallenges and Risks of Web 3.0 — A New Digital World Order
Challenges and Risks of Web 3.0 — A New Digital World Order
Mindfire LLC
 
The Development Of Web 3
The Development Of Web 3The Development Of Web 3
The Development Of Web 3
Marnusharris
 
Countering Cyber Threats By Monitoring “Normal” Website Behavior
Countering Cyber Threats By Monitoring “Normal” Website BehaviorCountering Cyber Threats By Monitoring “Normal” Website Behavior
Countering Cyber Threats By Monitoring “Normal” Website Behavior
EMC
 
Distil Network Sponsor Presentation at the Property Portal Watch Conference -...
Distil Network Sponsor Presentation at the Property Portal Watch Conference -...Distil Network Sponsor Presentation at the Property Portal Watch Conference -...
Distil Network Sponsor Presentation at the Property Portal Watch Conference -...
Property Portal Watch
 
Basic computer courses in Ambla Cantt! Batra Computer Centre
Basic  computer  courses in Ambla Cantt! Batra Computer CentreBasic  computer  courses in Ambla Cantt! Batra Computer Centre
Basic computer courses in Ambla Cantt! Batra Computer Centre
Simran Grover
 
WEB 3.0 The Decentralized Web.pptx
WEB 3.0 The Decentralized Web.pptxWEB 3.0 The Decentralized Web.pptx
WEB 3.0 The Decentralized Web.pptx
Udoy Hasan
 
The Whys and Wherefores of Web Security – by United Security Providers
The Whys and Wherefores of Web Security – by United Security ProvidersThe Whys and Wherefores of Web Security – by United Security Providers
The Whys and Wherefores of Web Security – by United Security Providers
United Security Providers AG
 
Five Network Security Threats And How To Protect Your Business Wp101112
Five Network Security Threats And How To Protect Your Business Wp101112Five Network Security Threats And How To Protect Your Business Wp101112
Five Network Security Threats And How To Protect Your Business Wp101112Erik Ginalick
 
5 network-security-threats
5 network-security-threats5 network-security-threats
5 network-security-threatsReadWrite
 
Is web 3 an overengineered solution
Is web 3 an overengineered solutionIs web 3 an overengineered solution
Is web 3 an overengineered solution
Bellaj Badr
 
Workshop: Open Data - What's the Point?
Workshop: Open Data - What's the Point?Workshop: Open Data - What's the Point?
Workshop: Open Data - What's the Point?BPCW10
 
Web 3.0
Web 3.0Web 3.0
Web 3.0
Rajashree Rao
 
Updated Cyber Security and Fraud Prevention Tools Tactics
Updated Cyber Security and Fraud Prevention Tools TacticsUpdated Cyber Security and Fraud Prevention Tools Tactics
Updated Cyber Security and Fraud Prevention Tools TacticsBen Graybar
 
Is web scraping legal or not?
Is web scraping legal or not?Is web scraping legal or not?
Is web scraping legal or not?
Aparna Sharma
 
Cyber Crime and Security
Cyber Crime and SecurityCyber Crime and Security
Cyber Crime and Security
Md Nishad
 
Web 3.0 All the basics of the hype for beginners.pdf
Web 3.0 All the basics of the hype for beginners.pdfWeb 3.0 All the basics of the hype for beginners.pdf
Web 3.0 All the basics of the hype for beginners.pdf
James Brown
 
3D Internet Report
3D Internet Report3D Internet Report
3D Internet Report
maham4569
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicNana Kwame(Emeritus) Gyamfi
 
Info Session on Cybersecurity & Cybersecurity Study Jams
Info Session on Cybersecurity & Cybersecurity Study JamsInfo Session on Cybersecurity & Cybersecurity Study Jams
Info Session on Cybersecurity & Cybersecurity Study Jams
GDSCCVR
 

Similar to The Challenges in Crawling the Web (20)

A Comprehensive Guide to Web 3.0 Development Companies.
A Comprehensive Guide to Web 3.0 Development Companies.A Comprehensive Guide to Web 3.0 Development Companies.
A Comprehensive Guide to Web 3.0 Development Companies.
 
Challenges and Risks of Web 3.0 — A New Digital World Order
Challenges and Risks of Web 3.0 — A New Digital World OrderChallenges and Risks of Web 3.0 — A New Digital World Order
Challenges and Risks of Web 3.0 — A New Digital World Order
 
The Development Of Web 3
The Development Of Web 3The Development Of Web 3
The Development Of Web 3
 
Countering Cyber Threats By Monitoring “Normal” Website Behavior
Countering Cyber Threats By Monitoring “Normal” Website BehaviorCountering Cyber Threats By Monitoring “Normal” Website Behavior
Countering Cyber Threats By Monitoring “Normal” Website Behavior
 
Distil Network Sponsor Presentation at the Property Portal Watch Conference -...
Distil Network Sponsor Presentation at the Property Portal Watch Conference -...Distil Network Sponsor Presentation at the Property Portal Watch Conference -...
Distil Network Sponsor Presentation at the Property Portal Watch Conference -...
 
Basic computer courses in Ambla Cantt! Batra Computer Centre
Basic  computer  courses in Ambla Cantt! Batra Computer CentreBasic  computer  courses in Ambla Cantt! Batra Computer Centre
Basic computer courses in Ambla Cantt! Batra Computer Centre
 
WEB 3.0 The Decentralized Web.pptx
WEB 3.0 The Decentralized Web.pptxWEB 3.0 The Decentralized Web.pptx
WEB 3.0 The Decentralized Web.pptx
 
The Whys and Wherefores of Web Security – by United Security Providers
The Whys and Wherefores of Web Security – by United Security ProvidersThe Whys and Wherefores of Web Security – by United Security Providers
The Whys and Wherefores of Web Security – by United Security Providers
 
Five Network Security Threats And How To Protect Your Business Wp101112
Five Network Security Threats And How To Protect Your Business Wp101112Five Network Security Threats And How To Protect Your Business Wp101112
Five Network Security Threats And How To Protect Your Business Wp101112
 
5 network-security-threats
5 network-security-threats5 network-security-threats
5 network-security-threats
 
Is web 3 an overengineered solution
Is web 3 an overengineered solutionIs web 3 an overengineered solution
Is web 3 an overengineered solution
 
Workshop: Open Data - What's the Point?
Workshop: Open Data - What's the Point?Workshop: Open Data - What's the Point?
Workshop: Open Data - What's the Point?
 
Web 3.0
Web 3.0Web 3.0
Web 3.0
 
Updated Cyber Security and Fraud Prevention Tools Tactics
Updated Cyber Security and Fraud Prevention Tools TacticsUpdated Cyber Security and Fraud Prevention Tools Tactics
Updated Cyber Security and Fraud Prevention Tools Tactics
 
Is web scraping legal or not?
Is web scraping legal or not?Is web scraping legal or not?
Is web scraping legal or not?
 
Cyber Crime and Security
Cyber Crime and SecurityCyber Crime and Security
Cyber Crime and Security
 
Web 3.0 All the basics of the hype for beginners.pdf
Web 3.0 All the basics of the hype for beginners.pdfWeb 3.0 All the basics of the hype for beginners.pdf
Web 3.0 All the basics of the hype for beginners.pdf
 
3D Internet Report
3D Internet Report3D Internet Report
3D Internet Report
 
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-LogicSecurity-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
Security-Challenges-in-Implementing-Semantic-Web-Unifying-Logic
 
Info Session on Cybersecurity & Cybersecurity Study Jams
Info Session on Cybersecurity & Cybersecurity Study JamsInfo Session on Cybersecurity & Cybersecurity Study Jams
Info Session on Cybersecurity & Cybersecurity Study Jams
 

More from PromptCloud

Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021
PromptCloud
 
All You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfAll You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdf
PromptCloud
 
Web Scraping Myths vs. Facts
Web Scraping Myths vs. FactsWeb Scraping Myths vs. Facts
Web Scraping Myths vs. Facts
PromptCloud
 
Octoparse competitors.pdf
Octoparse competitors.pdfOctoparse competitors.pdf
Octoparse competitors.pdf
PromptCloud
 
Parsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxParsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptx
PromptCloud
 
Product Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxProduct Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptx
PromptCloud
 
Data Trends in Fashion Industry
Data Trends in Fashion IndustryData Trends in Fashion Industry
Data Trends in Fashion Industry
PromptCloud
 
Data Standardization with Web Data Integration
Data Standardization with Web Data Integration Data Standardization with Web Data Integration
Data Standardization with Web Data Integration
PromptCloud
 
Visualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesVisualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe Movies
PromptCloud
 
15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track
PromptCloud
 
Top Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersTop Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce Players
PromptCloud
 
The Birth of a Web Crawling Bot
The Birth of a Web Crawling BotThe Birth of a Web Crawling Bot
The Birth of a Web Crawling Bot
PromptCloud
 
Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019
PromptCloud
 
Zipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersZipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailers
PromptCloud
 
Analyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsAnalyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday Songs
PromptCloud
 
PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019
PromptCloud
 
Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019
PromptCloud
 
10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping
PromptCloud
 
How Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersHow Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate Marketers
PromptCloud
 
Hotel Review Data Analysis
Hotel Review Data AnalysisHotel Review Data Analysis
Hotel Review Data Analysis
PromptCloud
 

More from PromptCloud (20)

Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021
 
All You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfAll You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdf
 
Web Scraping Myths vs. Facts
Web Scraping Myths vs. FactsWeb Scraping Myths vs. Facts
Web Scraping Myths vs. Facts
 
Octoparse competitors.pdf
Octoparse competitors.pdfOctoparse competitors.pdf
Octoparse competitors.pdf
 
Parsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxParsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptx
 
Product Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxProduct Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptx
 
Data Trends in Fashion Industry
Data Trends in Fashion IndustryData Trends in Fashion Industry
Data Trends in Fashion Industry
 
Data Standardization with Web Data Integration
Data Standardization with Web Data Integration Data Standardization with Web Data Integration
Data Standardization with Web Data Integration
 
Visualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesVisualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe Movies
 
15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track
 
Top Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersTop Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce Players
 
The Birth of a Web Crawling Bot
The Birth of a Web Crawling BotThe Birth of a Web Crawling Bot
The Birth of a Web Crawling Bot
 
Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019
 
Zipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersZipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailers
 
Analyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsAnalyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday Songs
 
PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019
 
Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019
 
10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping
 
How Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersHow Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate Marketers
 
Hotel Review Data Analysis
Hotel Review Data AnalysisHotel Review Data Analysis
Hotel Review Data Analysis
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

The Challenges in Crawling the Web

  • 1.
  • 2. THE CHALLENGES IN CRAWLING THE WEB.
  • 3. As an ever-evolving field, extracting data from the web is still a gray area. No clear ground rules regarding the legality of web scraping exists! The concern over privacy issues on collecting data off the Web is growing. People are wary about how data is or can be used.
  • 4. Increasingly, Big Data is being frowned upon. Its harvesting, even more so! Yet, undeniably, data crawling is growing exponentially. As it grows, the Web is gradually becoming more complicated to crawl.
  • 5. CHALLENGE I NON-UNIFORM STRUCTURES Data formats & structures are inconsistent in the Web space. Norms on how to build an Internet presence are non-existent. The result? Lack of uniformity and the vast ever-changing terrains of the Internet. The problem? Collecting data in a machine-readable format becomes difficult.
  • 6. Problems increase with increase in scale! Especially, when: a) structured data is needed, and, b) large number of details are to be extracted from multiple sources.
  • 7. CHALLENGE II OMNIPRESENCE OF AJAX ELEMENTS AJAX and interactive web components make websites more user-friendly. But not for crawlers! The result? Content is produced dynamically (and on-the-go) by the browser and therefore not visible to crawlers. The problem? To keep the content up-to-date, the crawler needs to be maintained manually on a regular basis. Even Google’s crawlers find it difficult to extract information!
  • 8. Crawlers need to be refined in their approach to be more efficient and scalable. We have a solution that makes crawling AJAX pages prompt. Click here.
  • 9. CHALLENGE III THE “REAL” REAL-TIME LATENCY Acquiring data-sets in real-time is a huge problem! Real-time data is critical in security and intelligence to predict, report, and enable preemptive actions. The result? While near-real-time is achieved, real-time latency remains the Holy Grail. The problem? The real problem comes in deciding what is and isn't important in real time.
  • 10. CHALLENGE IV WHO OWNS UGC? User-Generated Content (UGC) proprietorship is claimed by giants like Craigslist and Yelp and is usually out-of-bounds for commercial crawlers. The result? Only 2-3 % sites disallow bots. Others believe in data democratization, but it is possible these may follow suit and shut access to the data gold mine! The problem? Site policing for web scraping and rejecting bots.
  • 11. CHALLENGE V THE RISE OF ANTI-SCRAPING TOOLS Tools like ScrapeDefender, ScrapeShield, ScrapeSentry are capable of differentiating bots from humans. The result? Restriction on web crawlers via e-mail obfuscation, real- time monitoring, and instant alerts etc. The problem? This is <1%, yet it may rise; all thanks to rogue crawlers, responsible for multiple hits on target servers. DDoS becomes unavoidable!
  • 12. Web data is a vast uncharted territory full of bounty, and having the proper tools helps. So does knowing how to use them since there exists a very thin line between being ‘crawlers’ and ‘hackers’. And this is where the genuine concern for privacy arises. At PromptCloud, these crawling challenges are met head-on. Our two ground rules we recommend that every web-crawling solution should follow.
  • 13. COURTESY In our experience, a little courtesy goes a long way. Burdening small servers and causing DDoS on target sites is easy. Yet it is detrimental to the success of any company – especially small businesses! Rule #1 is to allow at least an interval of 2 seconds in successive requests. This helps avoid hitting servers too hard.
  • 14. CRAWLABILITY Many (and most) websites restrict the amount of data (either sections of the site or complete sites) that can be crawled by spiders via the robots.txt file. Rule #2 is to establish feasibility of such site(s)! It helps greatly to check the site’s policy on bots — whether it allows bots in target sections from where data is desired.