SlideShare a Scribd company logo
Frontera: open source, large scale web
crawling framework
Alexander Sibiryakov, October 15, 2015
sibiryakov@scrapinghub.com
• Software Engineer @
Scrapinghub
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
Hola a todos!
2
• Over 2 billion requests per month (~800 per second)
• Focused crawls & Broad crawls
We help turn web content
into useful data
3
{
"content": [
{
"title": {
"text": "'Extreme poverty'
to fall below 10% of world
population for first time",
"href": "http://
www.theguardian.com/society/2015/
oct/05/world-bank-extreme-poverty-
to-fall-below-10-of-world-
population-for-first-time"
},
"points": "9 points",
• News analysis
• Topical crawling
• Price change monitoring
• Sentiment analysis (popularity, likability)
• Due diligence (profile/business data)
• Lead generation (extracting contact information)
• Track criminal activity & find lost persons (DARPA)
Broad crawl usages
4
Task
• Crawl Spanish web to gather
statistics about hosts and their
sizes.
• Limit crawl to .es zone.
• Breadth-first strategy: first crawl
1-click distance documents,
next 2-clicks, and so on,
• Finishing condition: absence of
hosts with less than 100
crawled documents.
• Low costs.
5
Spanish internet (.es) in 2012
• Domain names registered - 1,56М (39% growth per
year)
• Web server in zone - 283,4K (33,1%)
• Hosts - 4,2M (21%)
• Spanish web sites in DMOZ catalog - 22043

* - отчет OECD Communications Outlook 2013
6
Solution
• Scrapy* - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear scanning,
scalability).
• Twisted.Internet - library for async primitives for use in workers.
• Snappy - efficient compression algorithm for IO-bounded
applications.
* - network operations in Scrapy are implemented asynchronously,
based on the same Twisted.Internet
7
Architecture
Kafka topic
SW
Crawling strategy
workers
Storage workers
8
DB
1. Big and small hosts
problem
• When crawler comes to huge
number of links from some
host, along with usage of
simple prioritization models, it
turns out queue is flooded with
URLs from the same host.
• That causes underuse of
spider resources.
• We adopted additional per-
host (optionally per-IP)
queue and metering
algorithm: URLs from big
hosts are cached in memory.
9
2. DDoS DNS service
Amazon AWS
• Breadth-first strategy assumes
first visiting of previously
unknown hosts, therefore
generating huge amount of
DNS request.
• Recursive DNS server on each
downloading node, with
upstream set to Verizon and
OpenDNS.
• We used dnsmasq.
10
3. Tuning Scrapy thread pool’а
for efficient DNS resolution
• Scrapy uses a thread pool to
resolve DNS name to IP.
• When ip is absent in cache,
request is sent to DNS server
in it’s own thread, which is
blocking.
• Scrapy reported numerous
errors related to DNS name
resolution and timeouts.
• We added option to Scrapy
for thread pool size and
timeout adjustment.
11
4. Overloaded HBase region
servers during state check
• Crawler extracts from document
hundreds of links in average.
• Before adding this links to queue, they
needs to be checked if they weren’t
already crawled (to avoid repetitive
visiting).
• On small volumes SSDs were just fine.
After increase of table size, we had to
move to HDDs, and response times
dramatically grew up.
• Host-local fingerprint function for
keys in HBase.
• Tuning HBase block cache to fit
average host states into one block.
12
5. Intensive network traffic
from workers to services
• We noticed throughput
between workers Kafka and
HBase up to 1Gbit/s.
• Switched to Thrift compact
protocol for HBase
communication.
• Message compression in
Kafka using Snappy.
13
6. Further query and traffic
optimizations to HBase
• State check required lion’s
share of requests and
network throughput.
• Consistency was another
requirement.
• We created local state cache
in strategy worker.
• For consistency, spider log
was partitioned by host, to
avoid cache overlap
between workers.
14
State cache
• All operations are batched:
• If key is absent in cache, it’s
requested from HBase,
• every ~4K documents
cache is flushed to HBase.
• When achieving 3M (~1Гб)
elements, flush and cleanup
happens.
• It seems Least-Recently-Used
(LRU) algorithm is a good fit
there.
15
Spider priority queue (slot)
• Cell has an array of:

- fingerprint, 

- Crc32(hostname), 

- URL, 

- score
• Dequeueing top N.
• Such design is prone to huge
hosts.
• Partially this problem can be
solved using scoring model
taking into account known
document count per host.
16
7. Problem of big and small
hosts (strikes back!)
• During crawling we’ve found few
very huge hosts (>20M docs)
• All queue partitions were
flooded with pages from few
huge hosts, because of queue
design and scoring model used.
• We made two MapReduce
jobs:
• queue shuffling,
• limiting all hosts to no more
than 100 documents.
17
• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
18
• Apache HBase,
• Apache Kafka,
• Python 2.7+,
• Scrapy 0.24+,
• DNS Service.
Software requirements
CDH (100% Open source
Hadoop package)
19
Maintaining Cloudera Hadoop on
Amazon EC2
• CDH is very sensitive to free space on root partition, parcels, and
storage of Cloudera Manager.
• We’ve moved it using symbolic links to separate EBS partition.
• EBS should be at least 30Gb, base IOPS should be enough.
• Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
• After one week of crawling, we ran out of space, and started to
move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
20
Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es,
druni.es,
docentesconeducacion.es -
are the biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than
50M pages
21
where are the rest of
web servers?!
Bow-tie model
A. Broder et al. / Computer Networks 33 (2000) 309-320
23
Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
24
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
25
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
26
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate
module.
• Polite by design: each website is downloaded by
at most one spider.
• Python: workers, spiders.
Distributed Frontera features
27
References
• Distributed Frontera. https://github.com/
scrapinghub/distributed-frontera
• Frontera. https://github.com/scrapinghub/frontera
• Documentation:
• http://distributed-frontera.readthedocs.org/
• http://frontera.readthedocs.org/
28
Future plans
• Lighter version, without HBase
and Kafka. Communicating
using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking
website content changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub
services.
• Testing on larger volumes.
29
Contribute!
• Distributed Frontera is a
historically first attempt to
implement web scale web
crawler using Python.
• Truly resource-intensive task:
CPU, network, disks.
• Made in Scrapinghub, a
company where Scrapy was
created.
• A plans to become an Apache
Software Foundation project.
30
We’re hiring!
http://scrapinghub.com/jobs/
31
32
Mandatory sales slide
Crawl the web, at scale
• cloud-based platform
• smart proxy rotator
Get data, hassle-free
• off-the-shelf datasets
• turn-key web scraping
try.scrapinghub.com/BDS15
Gracias!
Thank you!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com

More Related Content

What's hot

Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Jeremy Zawodny
 
Securing Your MongoDB Deployment
Securing Your MongoDB DeploymentSecuring Your MongoDB Deployment
Securing Your MongoDB Deployment
MongoDB
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache Pulsar
StreamNative
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
MongoDB
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
Treasure Data, Inc.
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
johnrjenson
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
MongoDB
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
Security monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshareSecurity monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshare
ReZa AdineH
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
台灣資料科學年會
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
Shivji Kumar Jha
 
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDBWebinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
MongoDB
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
Tuan Luong
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控
Jui An Huang (黃瑞安)
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
BowenDing4
 
Log Files
Log FilesLog Files

What's hot (20)

Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
 
Securing Your MongoDB Deployment
Securing Your MongoDB DeploymentSecuring Your MongoDB Deployment
Securing Your MongoDB Deployment
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache Pulsar
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
 
Security monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshareSecurity monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshare
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
 
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDBWebinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
 
Log Files
Log FilesLog Files
Log Files
 

Viewers also liked

Kubernetes and bluemix
Kubernetes  and  bluemixKubernetes  and  bluemix
Kubernetes and bluemix
DuckDuckGo
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
Business use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureBusiness use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise Architecture
NUS-ISS
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
Brian Grant
 
Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes
Weaveworks
 
Deep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureDeep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer Architecture
WSO2
 
A brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsA brief study on Kubernetes and its components
A brief study on Kubernetes and its components
Ramit Surana
 
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Ambassador Labs
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Bob Cotton
 

Viewers also liked (9)

Kubernetes and bluemix
Kubernetes  and  bluemixKubernetes  and  bluemix
Kubernetes and bluemix
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Business use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureBusiness use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise Architecture
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
 
Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes
 
Deep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureDeep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer Architecture
 
A brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsA brief study on Kubernetes and its components
A brief study on Kubernetes and its components
 
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
 

Similar to Frontera: open source, large scale web crawling framework

Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
MongoDB
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
hdhappy001
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jc
James Saint-Rossy
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
Amazon Web Services
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
Eugenio Minardi
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 

Similar to Frontera: open source, large scale web crawling framework (20)

Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jc
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 

Recently uploaded

Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 

Recently uploaded (20)

Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 

Frontera: open source, large scale web crawling framework

  • 1. Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 15, 2015 sibiryakov@scrapinghub.com
  • 2. • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. Hola a todos! 2
  • 3. • Over 2 billion requests per month (~800 per second) • Focused crawls & Broad crawls We help turn web content into useful data 3 { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http:// www.theguardian.com/society/2015/ oct/05/world-bank-extreme-poverty- to-fall-below-10-of-world- population-for-first-time" }, "points": "9 points",
  • 4. • News analysis • Topical crawling • Price change monitoring • Sentiment analysis (popularity, likability) • Due diligence (profile/business data) • Lead generation (extracting contact information) • Track criminal activity & find lost persons (DARPA) Broad crawl usages 4
  • 5. Task • Crawl Spanish web to gather statistics about hosts and their sizes. • Limit crawl to .es zone. • Breadth-first strategy: first crawl 1-click distance documents, next 2-clicks, and so on, • Finishing condition: absence of hosts with less than 100 crawled documents. • Low costs. 5
  • 6. Spanish internet (.es) in 2012 • Domain names registered - 1,56М (39% growth per year) • Web server in zone - 283,4K (33,1%) • Hosts - 4,2M (21%) • Spanish web sites in DMOZ catalog - 22043
 * - отчет OECD Communications Outlook 2013 6
  • 7. Solution • Scrapy* - network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability). • Twisted.Internet - library for async primitives for use in workers. • Snappy - efficient compression algorithm for IO-bounded applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet 7
  • 9. 1. Big and small hosts problem • When crawler comes to huge number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host. • That causes underuse of spider resources. • We adopted additional per- host (optionally per-IP) queue and metering algorithm: URLs from big hosts are cached in memory. 9
  • 10. 2. DDoS DNS service Amazon AWS • Breadth-first strategy assumes first visiting of previously unknown hosts, therefore generating huge amount of DNS request. • Recursive DNS server on each downloading node, with upstream set to Verizon and OpenDNS. • We used dnsmasq. 10
  • 11. 3. Tuning Scrapy thread pool’а for efficient DNS resolution • Scrapy uses a thread pool to resolve DNS name to IP. • When ip is absent in cache, request is sent to DNS server in it’s own thread, which is blocking. • Scrapy reported numerous errors related to DNS name resolution and timeouts. • We added option to Scrapy for thread pool size and timeout adjustment. 11
  • 12. 4. Overloaded HBase region servers during state check • Crawler extracts from document hundreds of links in average. • Before adding this links to queue, they needs to be checked if they weren’t already crawled (to avoid repetitive visiting). • On small volumes SSDs were just fine. After increase of table size, we had to move to HDDs, and response times dramatically grew up. • Host-local fingerprint function for keys in HBase. • Tuning HBase block cache to fit average host states into one block. 12
  • 13. 5. Intensive network traffic from workers to services • We noticed throughput between workers Kafka and HBase up to 1Gbit/s. • Switched to Thrift compact protocol for HBase communication. • Message compression in Kafka using Snappy. 13
  • 14. 6. Further query and traffic optimizations to HBase • State check required lion’s share of requests and network throughput. • Consistency was another requirement. • We created local state cache in strategy worker. • For consistency, spider log was partitioned by host, to avoid cache overlap between workers. 14
  • 15. State cache • All operations are batched: • If key is absent in cache, it’s requested from HBase, • every ~4K documents cache is flushed to HBase. • When achieving 3M (~1Гб) elements, flush and cleanup happens. • It seems Least-Recently-Used (LRU) algorithm is a good fit there. 15
  • 16. Spider priority queue (slot) • Cell has an array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N. • Such design is prone to huge hosts. • Partially this problem can be solved using scoring model taking into account known document count per host. 16
  • 17. 7. Problem of big and small hosts (strikes back!) • During crawling we’ve found few very huge hosts (>20M docs) • All queue partitions were flooded with pages from few huge hosts, because of queue design and scoring model used. • We made two MapReduce jobs: • queue shuffling, • limiting all hosts to no more than 100 documents. 17
  • 18. • Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores. Hardware requirements 18
  • 19. • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, • DNS Service. Software requirements CDH (100% Open source Hadoop package) 19
  • 20. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). • After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD). 20
  • 21. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages 21
  • 22. where are the rest of web servers?!
  • 23. Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320 23
  • 24. Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005 24
  • 25. Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014 25
  • 26. • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. Main features 26
  • 27. • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Python: workers, spiders. Distributed Frontera features 27
  • 28. References • Distributed Frontera. https://github.com/ scrapinghub/distributed-frontera • Frontera. https://github.com/scrapinghub/frontera • Documentation: • http://distributed-frontera.readthedocs.org/ • http://frontera.readthedocs.org/ 28
  • 29. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services. • Testing on larger volumes. 29
  • 30. Contribute! • Distributed Frontera is a historically first attempt to implement web scale web crawler using Python. • Truly resource-intensive task: CPU, network, disks. • Made in Scrapinghub, a company where Scrapy was created. • A plans to become an Apache Software Foundation project. 30
  • 32. 32 Mandatory sales slide Crawl the web, at scale • cloud-based platform • smart proxy rotator Get data, hassle-free • off-the-shelf datasets • turn-key web scraping try.scrapinghub.com/BDS15