Distributed Crawler Service architecture presentation
Report
Share
Gennady BaranovSenior Project Architect, System Analyst, Senior C++ Developer, Senior PHP Developer, Senior Python developer, QA Engineer, Tester, Development Process Administrator (svn, git, jira, odesk), System Administrator, Team Leader, SCRUM Product Owner, SCRUM Ma
Follow
•1 like•1,024 views
1 of 13
Distributed Crawler Service architecture presentation
Distributed Crawler Service architecture. Describes the main architecture, key principles and features list of the Distributed Crawler software.
Gennady BaranovSenior Project Architect, System Analyst, Senior C++ Developer, Senior PHP Developer, Senior Python developer, QA Engineer, Tester, Development Process Administrator (svn, git, jira, odesk), System Administrator, Team Leader, SCRUM Product Owner, SCRUM Ma
2. Introduction
The HCE-DC is a multipurpose high productivity scalable and
extensible engine of web data collecting and processing.
Built on several HCE project's products and technologies:
●
The Distributed Tasks Manager (DTM) service.
●
The hce-node network cluster application.
●
The API bindings for Python and PHP languages.
●
The tools and library of crawling algorithms.
●
The tools and library of scraping algorithms.
●
Web administration console.
●
The real-time REST API.
provides flexible configuration and deployment automation to get
installation closed to a target project and easy integration.
3. ●
Crawling - scan the web sites, analyze and parse pages, detect
and collect URLs links and web resources' data. Download
resources from web-servers using collected or provided with
request URL(s) and store them in local raw data file storage.
Everything on multi-host and multi-process architecture
●
Process web page contents with several customizable applied
algorithms like a unstructured textual content scraping and store
results in local sql db storage on multi-host and multi-process
architecture.
●
Manage tasks of crawling and processing with scheduling and
balancing using tasks management service of multi-host
architecture or real-time multi-threaded load-balancing
client-server architecture with multi-host backend engine.
The main functional purposes
4. The extended functionality
●
Developer's API – full access for configuration, deployment,
monitoring and management processes and data.
●
Applied API – full featured multi-thread multi-host REST http-based
protocol to perform crawling and scraping batch requests.
●
Web administration console - the DC and DTM services and
user's accounts, roles, permissions, crawling, scraping, results
collect, aggregation and convert management, statistical data
collect and visualize, notification, triggering and another utility
tools.
●
Helper tools and libraries – several support applied utility.
5. Distributed asynchronous nature
The HCE-DC engine itself is an architecturally fully distributed system. It
can be deployed and configured as single- and multi-host installation.
Key features and properties of distributed architecture:
●
No central database or data storage for crawling and
processing. Each physical host unit with the same structures
shards data but represented as single service.
●
Crawling and processing goes on several physical hosts parallel
multi-process way including downloading, fetching, DOM
parsing, URLs collecting, fields extracting, post processing,
metric calculations and so on tasks.
●
Customizable strategies of data sharding and requests
balancing with minimization of data redundancy and
optimization of resources usage.
●
Reducing of pages and scraped contents internally with smart
merging avoiding of resources duplicates in fetch data client
response.
6. Flexible balancing and scalability
The HCE-DC as service can be deployed on set of physical hosts.
The number of hosts depends on their hardware productivity rate
(CPU cores number, disk space size, network interface speed and so on) and
can be from one to tens or more. Key scalability principles are:
– The hardware computational unit is physical or logical
host (any kind of virtualization and containers supported).
– The hardware units can be added in to the system and
gradually filled with data during regular crawling iterations
at run-time. No dedicated data migration.
– Computational tasks balancing is resource usage
optimized. Tasks scheduler selects computational unit
with maximum free system resources using customizable
estimation formula. Different system resources usage
indicators available: CPU, RAM, DISK, IO wait,
processes number, threads number and so on.
7. Extensible software and algorithms
The HCE-DC service for the Linux OS platform has three main
parts:
●
The core service daemon modules with functionality of:
scheduler of crawling tasks, manager of tasks queues,
managers of periodical processes, manager of
computational units data, manager of storage resources
aging, manager of real-time API requests and so on.
Typically the core daemon process runs on dedicated
host and represents service itself.
●
The computational unit modules set including crawling
crawler-task, scraping algorithms processor-task and
several scraper modules, storage management db-task,
pre-rocessor, finalizer modules and several additional
helper utilities. They acts on session-based principles
and exits after the input batch data set processed.
8. Open processing architecture
The computational modules set can be extended with any kind of
algorithms, libraries and frameworks for any platforms and
programming languages. The limitation is only the API interaction
translation that typically needs some adapters. The key principles
are:
●
Data processing modules involved as native OS
processes or via API including REST and CLI.
●
Process instances are isolated.
●
POSIX CLI API is default for inter-process data exchange
or simulated by converter utilities.
●
Open input/output protocol used to process batch in
sequential way step by step by each processing chain.
●
Streaming open data formats can be easily serialized –
json, xml and so on.
12. Brief list of main DC features
Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling,
HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits
(size, pages, contents, errors, URLs, redirects, content types), requests delaying,
robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and
batching.
Fully automated distributed data processing: News ™ article (pre-defined sequential
scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™
universal (definitions of tags and rules to extract data from pages based on xpath and
csspath, content parts joining, merging, best result selection, regular expressions post
processing, multi-item pages (product, search results, etc), multi-rule, multi-template)
scraping engines, WYSIWYG templates editor, processed contents selection and
merging and extensible processing modules architecture.
Fully automated resources data management: periodic operations, data aging, update,
re-crawling and re-processing.
Web administration console: full CRUD of projects for data collect and process with set of
parameters per project, users with roles and permissions ACL, DC and DTM service's
statistics, crawling and processing project's statistics.
Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to
process, full featured parameters set with additionally limitations per user account and
authorization state.
Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
13. Statistics of three physical hosts
installation for one month
●
Projects: 8
●
Pages crawled: 6.2M
●
Crawling batches: 60K
●
Processing batches: 90K
●
Purging batches: 16K
●
Aging batches: 16K
●
Projects re-crawlings: 30K
●
CPU Load Average: 0.45 avg / 3.5 max
●
CPU utilization: 3% avg / 30% max
●
I/O wait time: 0.31 avg / 6.4 max
●
Network connections: 250 avg / 747 max
●
Network traffic: 152Kbps avg / 5.5Mbps max
●
Data hosts: 2
●
Load-balancing of system OS
resources linear managed
CPU load average, I/O wait
and RAM usage without
excesses and overloads.
●
Linear scalability of real-time
requests per physical host.
●
Linear scalability of automated
crawling, processing and
aging per physical host.