SlideShare a Scribd company logo
Scraping the Web with Scrapinghub
For Startups
“Getting information off the
Internet is like taking a drink
from a fire hydrant.”
– Mitchell Kapor
Who Uses Web Scraping
It is used by everyone from individuals to
multinational companies:
● Monitor your competitors’ prices by scraping
product information
● Detect fraudulent reviews and sentiment
changes by scraping product reviews
● Track online reputation by scraping social
media profiles
● Create apps that use public data
● Track SEO by scraping search engine results
Web Scraping Traffic
Scrapinghub
Our products empower our users to scrape data quickly and
effectively using open source technologies. We offer:
● A cloud-based platform to help you scale your crawlers
● A smart proxy rotator to crawl the web even faster
● Professional Services to handle web scraping and data
mining for you
● Off-the-shelf datasets so you can get data hassle-free
Scrapy
Scrapy is a web scraping framework that
gets the dirty work related to web crawling
out of your way.
Benefits
● No platform lock-in: Open Source
● Very popular (13k+ ★)
● Battle tested
● Highly extensible
● Great documentation
Portia
Portia is a Visual Scraping tool that lets you
get data without needing to write code.
Benefits
● No platform lock-in: Open Source
● JavaScript dynamic content generation
● Ideal for non-developers
● Extensible
● It’s as easy as annotating a page
Portia
Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on EC2 instances or dedicated servers
● Crawlera add-on
● Control your spiders: Command line, API or web UI
● Machine learning integration: BigML, MonkeyLearn, among others
● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure
Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box
● Distribute and scale custom web crawlers across servers
● Crawl Frontier Framework: large scale URL prioritization logic
● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
Web Scraping Pitfalls
Bot Countermeasures
Websites are using increasingly sophisticated techniques to
protect against bad bots.
Unfortunately, these same technologies often prevent
harmless bots from scraping content.
Common countermeasures include:
● IP address-based bans
● JavaScript and session based counter-measures
Blocked Crawlers
Servers identify and block crawlers that continuously fire many requests
to a website.
Solution: Meet Crawlera , our smart proxy rotator for web crawlers.
● Routes requests through a pool of 50k+ IPs
● Detects, logs and handles bans
● Polite scraping: Automatically throttles requests to websites
JavaScript in Web Pages
Dynamic content generated by JavaScript is often used by websites to
render the page (SPA) or to avoid being scraped by naive crawlers.
For simple instances, you can emulate the AJAX requests in Scrapy.
For complex cases, you can use Splash
● Works through an HTTP API
● Lua Scripts simulate user interaction
● No lock-in, it’s an open source project!
Duplicate Content
The web is full of duplicate content.
Duplicate Content negatively impacts:
● Storage
● Re-crawl performance
● Quality of data
Efficient algorithms for Near Duplicate Detection, like SimHash, are
applied to estimate similarity between web pages to avoid scraping
duplicated content.
Near Duplicate Detection Uses
Compare prices of products scraped from different retailers by finding
near duplicates in a dataset:
Merge similar items to avoid duplicate entries:
Title Store Price
ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89
Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95
Name Summary Location
Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the
Victorian architect William Burges…
51.8944, -8.48064
St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695
Examples of Web Scraping Usage
Competitor Monitoring
E-commerce companies use web scraping to
monitor the price fluctuations and the ratings of
competitors:
● Scrape online retailers
● Structure the data in a search engine or DB
● Create an interface to search for products
● Sentiment analysis for product rankings
We help electronics companies monitor the activities of their resellers:
● Tracking and watching out for stolen goods
● Pricing agreement violations
● Customer support responses on complaints
● Product line quality checks
Monitor Resellers
Lead Generation
Mine scraped data to identify who to target in a company for your
outbound sales campaigns:
● Locate possible leads in your target market
● Identify the right contacts within each one
● Augment the information you already have on them
● Use data science to guess their email address
Reduce the time spent on HR tasks by creating a
select pool of applicants:
● Mine scraped data to locate candidates
● Match requisite skills and background
● Spot and rescue employees that are shopping
for a new job
Human Resources
Thank you!Thank you!
scrapinghub.com

More Related Content

What's hot

Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
Treasure Data, Inc.
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
Norberto Leite
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
MongoDB
 
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsWebinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
MongoDB
 
Capacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB ClusterCapacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB Cluster
MongoDB
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 
Securing Your MongoDB Deployment
Securing Your MongoDB DeploymentSecuring Your MongoDB Deployment
Securing Your MongoDB Deployment
MongoDB
 
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow ZurichHow to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
Patrick Baumgartner
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
Treasure Data, Inc.
 
Log File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkitLog File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkit
Tom Bennet
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
MongoDB
 
Fluentd - Unified logging layer
Fluentd -  Unified logging layerFluentd -  Unified logging layer
Fluentd - Unified logging layer
Treasure Data, Inc.
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
MongoDB
 
Mongodb
MongodbMongodb
Mongodb
foliba
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
Treasure Data, Inc.
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
Pierre Baillet
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producción
MongoDB
 
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory ComputingWebinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
MongoDB
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
ArangoDB Database
 

What's hot (20)

Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business InsightsWebinar: MongoDB and Hadoop - Working Together to provide Business Insights
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
 
Capacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB ClusterCapacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB Cluster
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
Securing Your MongoDB Deployment
Securing Your MongoDB DeploymentSecuring Your MongoDB Deployment
Securing Your MongoDB Deployment
 
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow ZurichHow to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow Zurich
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Log File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkitLog File Analysis: The most powerful tool in your SEO toolkit
Log File Analysis: The most powerful tool in your SEO toolkit
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Fluentd - Unified logging layer
Fluentd -  Unified logging layerFluentd -  Unified logging layer
Fluentd - Unified logging layer
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
 
Mongodb
MongodbMongodb
Mongodb
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producción
 
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory ComputingWebinar: Elevate Your Enterprise Architecture with In-Memory Computing
Webinar: Elevate Your Enterprise Architecture with In-Memory Computing
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
 

Viewers also liked

Paper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive MultimediaPaper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive Multimedia
Jeanette Howe
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
Scrapinghub
 
AIMS Website Revamp
AIMS Website RevampAIMS Website Revamp
WEB Analytics - Data Mining - MIS - eBusiness website
WEB Analytics  - Data Mining - MIS - eBusiness website WEB Analytics  - Data Mining - MIS - eBusiness website
WEB Analytics - Data Mining - MIS - eBusiness website
Jyotindra Zaveri
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
Scrapinghub
 
HTML CSS & Javascript
HTML CSS & JavascriptHTML CSS & Javascript
HTML CSS & Javascript
David Lindkvist
 

Viewers also liked (6)

Paper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive MultimediaPaper Presentation: Data Mining User Preference in Interactive Multimedia
Paper Presentation: Data Mining User Preference in Interactive Multimedia
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
 
AIMS Website Revamp
AIMS Website RevampAIMS Website Revamp
AIMS Website Revamp
 
WEB Analytics - Data Mining - MIS - eBusiness website
WEB Analytics  - Data Mining - MIS - eBusiness website WEB Analytics  - Data Mining - MIS - eBusiness website
WEB Analytics - Data Mining - MIS - eBusiness website
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
HTML CSS & Javascript
HTML CSS & JavascriptHTML CSS & Javascript
HTML CSS & Javascript
 

Similar to Scrapinghub Deck for Startups

Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022
Aparna Sharma
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
Aparna Sharma
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
Brijesh Prajapati
 
iWeb Scraping Services, India
iWeb Scraping Services, IndiaiWeb Scraping Services, India
iWeb Scraping Services, India
iWeb Scraping Services, India
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
Dataiku
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
Lars Albertsson
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
STELIANCREANGA
 
Introduction to Big Data using AWS Services
Introduction to Big Data using AWS ServicesIntroduction to Big Data using AWS Services
Introduction to Big Data using AWS Services
Anjani Phuyal
 
OWF14 - Big Data Track : Take back control of your web tracking Go further by...
OWF14 - Big Data Track : Take back control of your web tracking Go further by...OWF14 - Big Data Track : Take back control of your web tracking Go further by...
OWF14 - Big Data Track : Take back control of your web tracking Go further by...
Paris Open Source Summit
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy Cabral
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
Data Scraping and Data Extraction
 
How Startups can leverage big data?
How Startups can leverage big data?How Startups can leverage big data?
How Startups can leverage big data?
Rackspace
 
Web Scraper For Online Stores .
Web Scraper For Online Stores .Web Scraper For Online Stores .
Web Scraper For Online Stores .
shreyashinde56
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Daniel Zivkovic
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
Sri Ambati
 
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
Patrick Guimonet
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Bastian Grimm
 
Null 1
Null 1Null 1
crawl technology saves money and time
crawl technology saves money and timecrawl technology saves money and time
crawl technology saves money and time
HashScraper Inc.
 
Door Of Internet
Door Of InternetDoor Of Internet
Door Of Internet
Kuldeep Padhiyar
 

Similar to Scrapinghub Deck for Startups (20)

Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
iWeb Scraping Services, India
iWeb Scraping Services, IndiaiWeb Scraping Services, India
iWeb Scraping Services, India
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
Introduction to Big Data using AWS Services
Introduction to Big Data using AWS ServicesIntroduction to Big Data using AWS Services
Introduction to Big Data using AWS Services
 
OWF14 - Big Data Track : Take back control of your web tracking Go further by...
OWF14 - Big Data Track : Take back control of your web tracking Go further by...OWF14 - Big Data Track : Take back control of your web tracking Go further by...
OWF14 - Big Data Track : Take back control of your web tracking Go further by...
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
How Startups can leverage big data?
How Startups can leverage big data?How Startups can leverage big data?
How Startups can leverage big data?
 
Web Scraper For Online Stores .
Web Scraper For Online Stores .Web Scraper For Online Stores .
Web Scraper For Online Stores .
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
 
Null 1
Null 1Null 1
Null 1
 
crawl technology saves money and time
crawl technology saves money and timecrawl technology saves money and time
crawl technology saves money and time
 
Door Of Internet
Door Of InternetDoor Of Internet
Door Of Internet
 

Recently uploaded

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 

Recently uploaded (20)

一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 

Scrapinghub Deck for Startups

  • 1. Scraping the Web with Scrapinghub For Startups
  • 2. “Getting information off the Internet is like taking a drink from a fire hydrant.” – Mitchell Kapor
  • 3. Who Uses Web Scraping It is used by everyone from individuals to multinational companies: ● Monitor your competitors’ prices by scraping product information ● Detect fraudulent reviews and sentiment changes by scraping product reviews ● Track online reputation by scraping social media profiles ● Create apps that use public data ● Track SEO by scraping search engine results
  • 5. Scrapinghub Our products empower our users to scrape data quickly and effectively using open source technologies. We offer: ● A cloud-based platform to help you scale your crawlers ● A smart proxy rotator to crawl the web even faster ● Professional Services to handle web scraping and data mining for you ● Off-the-shelf datasets so you can get data hassle-free
  • 6. Scrapy Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way. Benefits ● No platform lock-in: Open Source ● Very popular (13k+ ★) ● Battle tested ● Highly extensible ● Great documentation
  • 7. Portia Portia is a Visual Scraping tool that lets you get data without needing to write code. Benefits ● No platform lock-in: Open Source ● JavaScript dynamic content generation ● Ideal for non-developers ● Extensible ● It’s as easy as annotating a page
  • 9. Large Scale Infrastructure Meet Scrapy Cloud , our PaaS for web crawlers: ● Scalable: Crawlers run on EC2 instances or dedicated servers ● Crawlera add-on ● Control your spiders: Command line, API or web UI ● Machine learning integration: BigML, MonkeyLearn, among others ● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure
  • 10. Broad Crawls Frontera allows us to build large scale web crawlers in Python: ● Scrapy support out of the box ● Distribute and scale custom web crawlers across servers ● Crawl Frontier Framework: large scale URL prioritization logic ● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
  • 12. Bot Countermeasures Websites are using increasingly sophisticated techniques to protect against bad bots. Unfortunately, these same technologies often prevent harmless bots from scraping content. Common countermeasures include: ● IP address-based bans ● JavaScript and session based counter-measures
  • 13. Blocked Crawlers Servers identify and block crawlers that continuously fire many requests to a website. Solution: Meet Crawlera , our smart proxy rotator for web crawlers. ● Routes requests through a pool of 50k+ IPs ● Detects, logs and handles bans ● Polite scraping: Automatically throttles requests to websites
  • 14. JavaScript in Web Pages Dynamic content generated by JavaScript is often used by websites to render the page (SPA) or to avoid being scraped by naive crawlers. For simple instances, you can emulate the AJAX requests in Scrapy. For complex cases, you can use Splash ● Works through an HTTP API ● Lua Scripts simulate user interaction ● No lock-in, it’s an open source project!
  • 15. Duplicate Content The web is full of duplicate content. Duplicate Content negatively impacts: ● Storage ● Re-crawl performance ● Quality of data Efficient algorithms for Near Duplicate Detection, like SimHash, are applied to estimate similarity between web pages to avoid scraping duplicated content.
  • 16. Near Duplicate Detection Uses Compare prices of products scraped from different retailers by finding near duplicates in a dataset: Merge similar items to avoid duplicate entries: Title Store Price ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89 Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95 Name Summary Location Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the Victorian architect William Burges… 51.8944, -8.48064 St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695
  • 17. Examples of Web Scraping Usage
  • 18. Competitor Monitoring E-commerce companies use web scraping to monitor the price fluctuations and the ratings of competitors: ● Scrape online retailers ● Structure the data in a search engine or DB ● Create an interface to search for products ● Sentiment analysis for product rankings
  • 19. We help electronics companies monitor the activities of their resellers: ● Tracking and watching out for stolen goods ● Pricing agreement violations ● Customer support responses on complaints ● Product line quality checks Monitor Resellers
  • 20. Lead Generation Mine scraped data to identify who to target in a company for your outbound sales campaigns: ● Locate possible leads in your target market ● Identify the right contacts within each one ● Augment the information you already have on them ● Use data science to guess their email address
  • 21. Reduce the time spent on HR tasks by creating a select pool of applicants: ● Mine scraped data to locate candidates ● Match requisite skills and background ● Spot and rescue employees that are shopping for a new job Human Resources