Distributed Crawler Service architecture presentation

Gennady Baranov
Gennady BaranovSenior Project Architect, System Analyst, Senior C++ Developer, Senior PHP Developer, Senior Python developer, QA Engineer, Tester, Development Process Administrator (svn, git, jira, odesk), System Administrator, Team Leader, SCRUM Product Owner, SCRUM Ma
The Distributed Crawler
v2.5.x-chaika
Service architecture overview
The Hierarchical Cluster Engine project
IOIX Ukraine, 2014-2015
Introduction
The HCE-DC is a multipurpose high productivity scalable and
extensible engine of web data collecting and processing.
Built on several HCE project's products and technologies:
●
The Distributed Tasks Manager (DTM) service.
●
The hce-node network cluster application.
●
The API bindings for Python and PHP languages.
●
The tools and library of crawling algorithms.
●
The tools and library of scraping algorithms.
●
Web administration console.
●
The real-time REST API.
provides flexible configuration and deployment automation to get
installation closed to a target project and easy integration.
●
Crawling - scan the web sites, analyze and parse pages, detect
and collect URLs links and web resources' data. Download
resources from web-servers using collected or provided with
request URL(s) and store them in local raw data file storage.
Everything on multi-host and multi-process architecture
●
Process web page contents with several customizable applied
algorithms like a unstructured textual content scraping and store
results in local sql db storage on multi-host and multi-process
architecture.
●
Manage tasks of crawling and processing with scheduling and
balancing using tasks management service of multi-host
architecture or real-time multi-threaded load-balancing
client-server architecture with multi-host backend engine.
The main functional purposes
The extended functionality
●
Developer's API – full access for configuration, deployment,
monitoring and management processes and data.
●
Applied API – full featured multi-thread multi-host REST http-based
protocol to perform crawling and scraping batch requests.
●
Web administration console - the DC and DTM services and
user's accounts, roles, permissions, crawling, scraping, results
collect, aggregation and convert management, statistical data
collect and visualize, notification, triggering and another utility
tools.
●
Helper tools and libraries – several support applied utility.
Distributed asynchronous nature
The HCE-DC engine itself is an architecturally fully distributed system. It
can be deployed and configured as single- and multi-host installation.
Key features and properties of distributed architecture:
●
No central database or data storage for crawling and
processing. Each physical host unit with the same structures
shards data but represented as single service.
●
Crawling and processing goes on several physical hosts parallel
multi-process way including downloading, fetching, DOM
parsing, URLs collecting, fields extracting, post processing,
metric calculations and so on tasks.
●
Customizable strategies of data sharding and requests
balancing with minimization of data redundancy and
optimization of resources usage.
●
Reducing of pages and scraped contents internally with smart
merging avoiding of resources duplicates in fetch data client
response.
Flexible balancing and scalability
The HCE-DC as service can be deployed on set of physical hosts.
The number of hosts depends on their hardware productivity rate
(CPU cores number, disk space size, network interface speed and so on) and
can be from one to tens or more. Key scalability principles are:
– The hardware computational unit is physical or logical
host (any kind of virtualization and containers supported).
– The hardware units can be added in to the system and
gradually filled with data during regular crawling iterations
at run-time. No dedicated data migration.
– Computational tasks balancing is resource usage
optimized. Tasks scheduler selects computational unit
with maximum free system resources using customizable
estimation formula. Different system resources usage
indicators available: CPU, RAM, DISK, IO wait,
processes number, threads number and so on.
Extensible software and algorithms
The HCE-DC service for the Linux OS platform has three main
parts:
●
The core service daemon modules with functionality of:
scheduler of crawling tasks, manager of tasks queues,
managers of periodical processes, manager of
computational units data, manager of storage resources
aging, manager of real-time API requests and so on.
Typically the core daemon process runs on dedicated
host and represents service itself.
●
The computational unit modules set including crawling
crawler-task, scraping algorithms processor-task and
several scraper modules, storage management db-task,
pre-rocessor, finalizer modules and several additional
helper utilities. They acts on session-based principles
and exits after the input batch data set processed.
Open processing architecture
The computational modules set can be extended with any kind of
algorithms, libraries and frameworks for any platforms and
programming languages. The limitation is only the API interaction
translation that typically needs some adapters. The key principles
are:
●
Data processing modules involved as native OS
processes or via API including REST and CLI.
●
Process instances are isolated.
●
POSIX CLI API is default for inter-process data exchange
or simulated by converter utilities.
●
Open input/output protocol used to process batch in
sequential way step by step by each processing chain.
●
Streaming open data formats can be easily serialized –
json, xml and so on.
General DC service architecture
Real-time client-server architecture
Internal DC service architecture
Brief list of main DC features
Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling,
HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits
(size, pages, contents, errors, URLs, redirects, content types), requests delaying,
robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and
batching.
Fully automated distributed data processing: News ™ article (pre-defined sequential
scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™
universal (definitions of tags and rules to extract data from pages based on xpath and
csspath, content parts joining, merging, best result selection, regular expressions post
processing, multi-item pages (product, search results, etc), multi-rule, multi-template)
scraping engines, WYSIWYG templates editor, processed contents selection and
merging and extensible processing modules architecture.
Fully automated resources data management: periodic operations, data aging, update,
re-crawling and re-processing.
Web administration console: full CRUD of projects for data collect and process with set of
parameters per project, users with roles and permissions ACL, DC and DTM service's
statistics, crawling and processing project's statistics.
Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to
process, full featured parameters set with additionally limitations per user account and
authorization state.
Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
Statistics of three physical hosts
installation for one month
●
Projects: 8
●
Pages crawled: 6.2M
●
Crawling batches: 60K
●
Processing batches: 90K
●
Purging batches: 16K
●
Aging batches: 16K
●
Projects re-crawlings: 30K
●
CPU Load Average: 0.45 avg / 3.5 max
●
CPU utilization: 3% avg / 30% max
●
I/O wait time: 0.31 avg / 6.4 max
●
Network connections: 250 avg / 747 max
●
Network traffic: 152Kbps avg / 5.5Mbps max
●
Data hosts: 2
●
Load-balancing of system OS
resources linear managed
CPU load average, I/O wait
and RAM usage without
excesses and overloads.
●
Linear scalability of real-time
requests per physical host.
●
Linear scalability of automated
crawling, processing and
aging per physical host.
1 of 13

Recommended

CrawlerLD - Distributed crawler for linked data by
CrawlerLD - Distributed crawler for linked dataCrawlerLD - Distributed crawler for linked data
CrawlerLD - Distributed crawler for linked dataRaphael do Vale
641 views26 slides
Introduction to apache nutch by
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutchSigmoid
1.3K views19 slides
Nutch as a Web data mining platform by
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
17.1K views46 slides
No sq lv1_0 by
No sq lv1_0No sq lv1_0
No sq lv1_0Tuan Luong
285 views35 slides
Roaring with elastic search sangam2018 by
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
500 views50 slides
Web Crawler by
Web CrawlerWeb Crawler
Web Crawleriamthevictory
25.1K views19 slides

More Related Content

What's hot

Smart crawler a two stage crawler by
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
600 views14 slides
Elasticsearch Introduction by
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch IntroductionRoopendra Vishwakarma
3K views20 slides
Schema Agnostic Indexing with Azure DocumentDB by
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBDharma Shukla
2.7K views14 slides
Automating Research Data Management at Scale with Globus by
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
221 views38 slides
Globus Portal Framework (APS Workshop) by
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus
121 views27 slides
Hypermedia System Architecture for a Web of Things by
Hypermedia System Architecture for a Web of ThingsHypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of ThingsMichael Koster
808 views30 slides

What's hot(20)

Schema Agnostic Indexing with Azure DocumentDB by Dharma Shukla
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDB
Dharma Shukla2.7K views
Automating Research Data Management at Scale with Globus by Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
Globus 221 views
Globus Portal Framework (APS Workshop) by Globus
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)
Globus 121 views
Hypermedia System Architecture for a Web of Things by Michael Koster
Hypermedia System Architecture for a Web of ThingsHypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of Things
Michael Koster808 views
Data Orchestration at Scale (GlobusWorld Tour West) by Globus
Data Orchestration at Scale (GlobusWorld Tour West)Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)
Globus 102 views
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF... by CloudTechnologies
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies7.2K views
What I learnt: Elastic search & Kibana : introduction, installtion & configur... by Rahul K Chauhan
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Rahul K Chauhan1.1K views
MongoDB Pros and Cons by johnrjenson
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
johnrjenson16.8K views
Frontera-Open Source Large Scale Web Crawling Framework by sixtyone
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone7.7K views
Automating Research Data Flows with Globus (CHPC 2019 - South Africa) by Globus
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Globus 156 views
GlobusWorld 2020 Keynote by Globus
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
Globus 434 views
Globus: Beyond File Transfer by Globus
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
Globus 304 views
Azure doc db (slideshare) by David Green
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
David Green2.2K views

Similar to Distributed Crawler Service architecture presentation

HCE project brief by
HCE project briefHCE project brief
HCE project briefGennady Baranov
1.5K views18 slides
Bquery Reporting & Analytics Architecture by
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureCarst Vaartjes
947 views10 slides
Public Cloud Workshop by
Public Cloud WorkshopPublic Cloud Workshop
Public Cloud WorkshopAmer Ather
197 views37 slides
Cloud Foundry Technical Overview by
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overviewcornelia davis
96.5K views33 slides
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit by
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitAmazon Web Services
708 views37 slides
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More by
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2
293 views39 slides

Similar to Distributed Crawler Service architecture presentation(20)

Bquery Reporting & Analytics Architecture by Carst Vaartjes
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics Architecture
Carst Vaartjes947 views
Public Cloud Workshop by Amer Ather
Public Cloud WorkshopPublic Cloud Workshop
Public Cloud Workshop
Amer Ather197 views
Cloud Foundry Technical Overview by cornelia davis
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
cornelia davis96.5K views
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit by Amazon Web Services
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More by WSO2
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2293 views
WEB-DBMS A quick reference by Marc Dy
WEB-DBMS A quick referenceWEB-DBMS A quick reference
WEB-DBMS A quick reference
Marc Dy4.5K views
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m... by Sriskandarajah Suhothayan
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
Kubernetes Infra 2.0 by Deepak Sood
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
Deepak Sood151 views
Containers as Infrastructure for New Gen Apps by Khalid Ahmed
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen Apps
Khalid Ahmed547 views
Sitecore 7.5 xDB oh(No)SQL - Where is the data at? by Pieter Brinkman
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Pieter Brinkman3.3K views
HAWQ: a massively parallel processing SQL engine in hadoop by BigData Research
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
BigData Research4K views
Khushali Patel-resume- by Khushali11
Khushali Patel-resume-Khushali Patel-resume-
Khushali Patel-resume-
Khushali111.2K views

Recently uploaded

Using Qt under LGPL-3.0 by
Using Qt under LGPL-3.0Using Qt under LGPL-3.0
Using Qt under LGPL-3.0Burkhard Stubert
13 views11 slides
Airline Booking Software by
Airline Booking SoftwareAirline Booking Software
Airline Booking SoftwareSharmiMehta
9 views26 slides
ADDO_2022_CICID_Tom_Halpin.pdf by
ADDO_2022_CICID_Tom_Halpin.pdfADDO_2022_CICID_Tom_Halpin.pdf
ADDO_2022_CICID_Tom_Halpin.pdfTomHalpin9
5 views33 slides
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Lisi Hocke
35 views124 slides
predicting-m3-devopsconMunich-2023-v2.pptx by
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptxTier1 app
12 views33 slides
AI and Ml presentation .pptx by
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptxFayazAli87
14 views15 slides

Recently uploaded(20)

Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta9 views
ADDO_2022_CICID_Tom_Halpin.pdf by TomHalpin9
ADDO_2022_CICID_Tom_Halpin.pdfADDO_2022_CICID_Tom_Halpin.pdf
ADDO_2022_CICID_Tom_Halpin.pdf
TomHalpin95 views
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 views
predicting-m3-devopsconMunich-2023-v2.pptx by Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app12 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8714 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi216 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
Automated Testing of Microsoft Power BI Reports by RTTS
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS10 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy15 views
How to build dyanmic dashboards and ensure they always work by Wiiisdom
How to build dyanmic dashboards and ensure they always workHow to build dyanmic dashboards and ensure they always work
How to build dyanmic dashboards and ensure they always work
Wiiisdom14 views
FOSSLight Community Day 2023-11-30 by Shane Coughlan
FOSSLight Community Day 2023-11-30FOSSLight Community Day 2023-11-30
FOSSLight Community Day 2023-11-30
Shane Coughlan7 views
Quality Engineer: A Day in the Life by John Valentino
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the Life
John Valentino7 views
JioEngage_Presentation.pptx by admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254558 views
360 graden fabriek by info33492
360 graden fabriek360 graden fabriek
360 graden fabriek
info33492165 views
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation by HCLSoftware
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook AutomationDRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation
DRYiCE™ iAutomate: AI-enhanced Intelligent Runbook Automation
HCLSoftware6 views

Distributed Crawler Service architecture presentation

  • 1. The Distributed Crawler v2.5.x-chaika Service architecture overview The Hierarchical Cluster Engine project IOIX Ukraine, 2014-2015
  • 2. Introduction The HCE-DC is a multipurpose high productivity scalable and extensible engine of web data collecting and processing. Built on several HCE project's products and technologies: ● The Distributed Tasks Manager (DTM) service. ● The hce-node network cluster application. ● The API bindings for Python and PHP languages. ● The tools and library of crawling algorithms. ● The tools and library of scraping algorithms. ● Web administration console. ● The real-time REST API. provides flexible configuration and deployment automation to get installation closed to a target project and easy integration.
  • 3. ● Crawling - scan the web sites, analyze and parse pages, detect and collect URLs links and web resources' data. Download resources from web-servers using collected or provided with request URL(s) and store them in local raw data file storage. Everything on multi-host and multi-process architecture ● Process web page contents with several customizable applied algorithms like a unstructured textual content scraping and store results in local sql db storage on multi-host and multi-process architecture. ● Manage tasks of crawling and processing with scheduling and balancing using tasks management service of multi-host architecture or real-time multi-threaded load-balancing client-server architecture with multi-host backend engine. The main functional purposes
  • 4. The extended functionality ● Developer's API – full access for configuration, deployment, monitoring and management processes and data. ● Applied API – full featured multi-thread multi-host REST http-based protocol to perform crawling and scraping batch requests. ● Web administration console - the DC and DTM services and user's accounts, roles, permissions, crawling, scraping, results collect, aggregation and convert management, statistical data collect and visualize, notification, triggering and another utility tools. ● Helper tools and libraries – several support applied utility.
  • 5. Distributed asynchronous nature The HCE-DC engine itself is an architecturally fully distributed system. It can be deployed and configured as single- and multi-host installation. Key features and properties of distributed architecture: ● No central database or data storage for crawling and processing. Each physical host unit with the same structures shards data but represented as single service. ● Crawling and processing goes on several physical hosts parallel multi-process way including downloading, fetching, DOM parsing, URLs collecting, fields extracting, post processing, metric calculations and so on tasks. ● Customizable strategies of data sharding and requests balancing with minimization of data redundancy and optimization of resources usage. ● Reducing of pages and scraped contents internally with smart merging avoiding of resources duplicates in fetch data client response.
  • 6. Flexible balancing and scalability The HCE-DC as service can be deployed on set of physical hosts. The number of hosts depends on their hardware productivity rate (CPU cores number, disk space size, network interface speed and so on) and can be from one to tens or more. Key scalability principles are: – The hardware computational unit is physical or logical host (any kind of virtualization and containers supported). – The hardware units can be added in to the system and gradually filled with data during regular crawling iterations at run-time. No dedicated data migration. – Computational tasks balancing is resource usage optimized. Tasks scheduler selects computational unit with maximum free system resources using customizable estimation formula. Different system resources usage indicators available: CPU, RAM, DISK, IO wait, processes number, threads number and so on.
  • 7. Extensible software and algorithms The HCE-DC service for the Linux OS platform has three main parts: ● The core service daemon modules with functionality of: scheduler of crawling tasks, manager of tasks queues, managers of periodical processes, manager of computational units data, manager of storage resources aging, manager of real-time API requests and so on. Typically the core daemon process runs on dedicated host and represents service itself. ● The computational unit modules set including crawling crawler-task, scraping algorithms processor-task and several scraper modules, storage management db-task, pre-rocessor, finalizer modules and several additional helper utilities. They acts on session-based principles and exits after the input batch data set processed.
  • 8. Open processing architecture The computational modules set can be extended with any kind of algorithms, libraries and frameworks for any platforms and programming languages. The limitation is only the API interaction translation that typically needs some adapters. The key principles are: ● Data processing modules involved as native OS processes or via API including REST and CLI. ● Process instances are isolated. ● POSIX CLI API is default for inter-process data exchange or simulated by converter utilities. ● Open input/output protocol used to process batch in sequential way step by step by each processing chain. ● Streaming open data formats can be easily serialized – json, xml and so on.
  • 9. General DC service architecture
  • 11. Internal DC service architecture
  • 12. Brief list of main DC features Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling, HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits (size, pages, contents, errors, URLs, redirects, content types), requests delaying, robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and batching. Fully automated distributed data processing: News ™ article (pre-defined sequential scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™ universal (definitions of tags and rules to extract data from pages based on xpath and csspath, content parts joining, merging, best result selection, regular expressions post processing, multi-item pages (product, search results, etc), multi-rule, multi-template) scraping engines, WYSIWYG templates editor, processed contents selection and merging and extensible processing modules architecture. Fully automated resources data management: periodic operations, data aging, update, re-crawling and re-processing. Web administration console: full CRUD of projects for data collect and process with set of parameters per project, users with roles and permissions ACL, DC and DTM service's statistics, crawling and processing project's statistics. Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to process, full featured parameters set with additionally limitations per user account and authorization state. Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
  • 13. Statistics of three physical hosts installation for one month ● Projects: 8 ● Pages crawled: 6.2M ● Crawling batches: 60K ● Processing batches: 90K ● Purging batches: 16K ● Aging batches: 16K ● Projects re-crawlings: 30K ● CPU Load Average: 0.45 avg / 3.5 max ● CPU utilization: 3% avg / 30% max ● I/O wait time: 0.31 avg / 6.4 max ● Network connections: 250 avg / 747 max ● Network traffic: 152Kbps avg / 5.5Mbps max ● Data hosts: 2 ● Load-balancing of system OS resources linear managed CPU load average, I/O wait and RAM usage without excesses and overloads. ● Linear scalability of real-time requests per physical host. ● Linear scalability of automated crawling, processing and aging per physical host.