SlideShare a Scribd company logo
The Distributed Crawler
v2.5.x-chaika
Service architecture overview
The Hierarchical Cluster Engine project
IOIX Ukraine, 2014-2015
Introduction
The HCE-DC is a multipurpose high productivity scalable and
extensible engine of web data collecting and processing.
Built on several HCE project's products and technologies:
●
The Distributed Tasks Manager (DTM) service.
●
The hce-node network cluster application.
●
The API bindings for Python and PHP languages.
●
The tools and library of crawling algorithms.
●
The tools and library of scraping algorithms.
●
Web administration console.
●
The real-time REST API.
provides flexible configuration and deployment automation to get
installation closed to a target project and easy integration.
●
Crawling - scan the web sites, analyze and parse pages, detect
and collect URLs links and web resources' data. Download
resources from web-servers using collected or provided with
request URL(s) and store them in local raw data file storage.
Everything on multi-host and multi-process architecture
●
Process web page contents with several customizable applied
algorithms like a unstructured textual content scraping and store
results in local sql db storage on multi-host and multi-process
architecture.
●
Manage tasks of crawling and processing with scheduling and
balancing using tasks management service of multi-host
architecture or real-time multi-threaded load-balancing
client-server architecture with multi-host backend engine.
The main functional purposes
The extended functionality
●
Developer's API – full access for configuration, deployment,
monitoring and management processes and data.
●
Applied API – full featured multi-thread multi-host REST http-based
protocol to perform crawling and scraping batch requests.
●
Web administration console - the DC and DTM services and
user's accounts, roles, permissions, crawling, scraping, results
collect, aggregation and convert management, statistical data
collect and visualize, notification, triggering and another utility
tools.
●
Helper tools and libraries – several support applied utility.
Distributed asynchronous nature
The HCE-DC engine itself is an architecturally fully distributed system. It
can be deployed and configured as single- and multi-host installation.
Key features and properties of distributed architecture:
●
No central database or data storage for crawling and
processing. Each physical host unit with the same structures
shards data but represented as single service.
●
Crawling and processing goes on several physical hosts parallel
multi-process way including downloading, fetching, DOM
parsing, URLs collecting, fields extracting, post processing,
metric calculations and so on tasks.
●
Customizable strategies of data sharding and requests
balancing with minimization of data redundancy and
optimization of resources usage.
●
Reducing of pages and scraped contents internally with smart
merging avoiding of resources duplicates in fetch data client
response.
Flexible balancing and scalability
The HCE-DC as service can be deployed on set of physical hosts.
The number of hosts depends on their hardware productivity rate
(CPU cores number, disk space size, network interface speed and so on) and
can be from one to tens or more. Key scalability principles are:
– The hardware computational unit is physical or logical
host (any kind of virtualization and containers supported).
– The hardware units can be added in to the system and
gradually filled with data during regular crawling iterations
at run-time. No dedicated data migration.
– Computational tasks balancing is resource usage
optimized. Tasks scheduler selects computational unit
with maximum free system resources using customizable
estimation formula. Different system resources usage
indicators available: CPU, RAM, DISK, IO wait,
processes number, threads number and so on.
Extensible software and algorithms
The HCE-DC service for the Linux OS platform has three main
parts:
●
The core service daemon modules with functionality of:
scheduler of crawling tasks, manager of tasks queues,
managers of periodical processes, manager of
computational units data, manager of storage resources
aging, manager of real-time API requests and so on.
Typically the core daemon process runs on dedicated
host and represents service itself.
●
The computational unit modules set including crawling
crawler-task, scraping algorithms processor-task and
several scraper modules, storage management db-task,
pre-rocessor, finalizer modules and several additional
helper utilities. They acts on session-based principles
and exits after the input batch data set processed.
Open processing architecture
The computational modules set can be extended with any kind of
algorithms, libraries and frameworks for any platforms and
programming languages. The limitation is only the API interaction
translation that typically needs some adapters. The key principles
are:
●
Data processing modules involved as native OS
processes or via API including REST and CLI.
●
Process instances are isolated.
●
POSIX CLI API is default for inter-process data exchange
or simulated by converter utilities.
●
Open input/output protocol used to process batch in
sequential way step by step by each processing chain.
●
Streaming open data formats can be easily serialized –
json, xml and so on.
General DC service architecture
Real-time client-server architecture
Internal DC service architecture
Brief list of main DC features
Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling,
HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits
(size, pages, contents, errors, URLs, redirects, content types), requests delaying,
robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and
batching.
Fully automated distributed data processing: News ™ article (pre-defined sequential
scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™
universal (definitions of tags and rules to extract data from pages based on xpath and
csspath, content parts joining, merging, best result selection, regular expressions post
processing, multi-item pages (product, search results, etc), multi-rule, multi-template)
scraping engines, WYSIWYG templates editor, processed contents selection and
merging and extensible processing modules architecture.
Fully automated resources data management: periodic operations, data aging, update,
re-crawling and re-processing.
Web administration console: full CRUD of projects for data collect and process with set of
parameters per project, users with roles and permissions ACL, DC and DTM service's
statistics, crawling and processing project's statistics.
Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to
process, full featured parameters set with additionally limitations per user account and
authorization state.
Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
Statistics of three physical hosts
installation for one month
●
Projects: 8
●
Pages crawled: 6.2M
●
Crawling batches: 60K
●
Processing batches: 90K
●
Purging batches: 16K
●
Aging batches: 16K
●
Projects re-crawlings: 30K
●
CPU Load Average: 0.45 avg / 3.5 max
●
CPU utilization: 3% avg / 30% max
●
I/O wait time: 0.31 avg / 6.4 max
●
Network connections: 250 avg / 747 max
●
Network traffic: 152Kbps avg / 5.5Mbps max
●
Data hosts: 2
●
Load-balancing of system OS
resources linear managed
CPU load average, I/O wait
and RAM usage without
excesses and overloads.
●
Linear scalability of real-time
requests per physical host.
●
Linear scalability of automated
crawling, processing and
aging per physical host.

More Related Content

What's hot

Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
Rishikesh Pathak
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
Roopendra Vishwakarma
 
Schema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDB
Dharma Shukla
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
Globus
 
Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)
Globus
 
Hypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of ThingsHypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of Things
Michael Koster
 
Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)
Globus
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
Sanura Hettiarachchi
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Rahul K Chauhan
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
johnrjenson
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Frameworksixtyone
 
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Globus
 
Repository As A Service (RaaS) at ICPSR
Repository As A Service  (RaaS) at ICPSRRepository As A Service  (RaaS) at ICPSR
Repository As A Service (RaaS) at ICPSR
Harshakumar Ummerpillai
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
Globus
 
Web crawler
Web crawlerWeb crawler
Web crawler
anusha kurapati
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
Globus
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
David Green
 

What's hot (20)

Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Schema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDB
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)
 
Hypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of ThingsHypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of Things
 
Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
 
Generic Crawler
Generic CrawlerGeneric Crawler
Generic Crawler
 
Repository As A Service (RaaS) at ICPSR
Repository As A Service  (RaaS) at ICPSRRepository As A Service  (RaaS) at ICPSR
Repository As A Service (RaaS) at ICPSR
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
 

Similar to Distributed Crawler Service architecture presentation

HCE project brief
HCE project briefHCE project brief
HCE project brief
Gennady Baranov
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics Architecture
Carst Vaartjes
 
Public Cloud Workshop
Public Cloud WorkshopPublic Cloud Workshop
Public Cloud Workshop
Amer Ather
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
cornelia davis
 
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Amazon Web Services
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2
 
WEB-DBMS A quick reference
WEB-DBMS A quick referenceWEB-DBMS A quick reference
WEB-DBMS A quick referenceMarc Dy
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpbigdata sunil
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
Sriskandarajah Suhothayan
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
Deepak Sood
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen Apps
Khalid Ahmed
 
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Pieter Brinkman
 
Ikenstudiolive
IkenstudioliveIkenstudiolive
Census Bureau PBOCS
Census Bureau PBOCSCensus Bureau PBOCS
Census Bureau PBOCS
Tolu A Williams
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
BigData Research
 
Khushali Patel-resume-
Khushali Patel-resume-Khushali Patel-resume-
Khushali Patel-resume-Khushali11
 

Similar to Distributed Crawler Service architecture presentation (20)

HCE project brief
HCE project briefHCE project brief
HCE project brief
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics Architecture
 
Public Cloud Workshop
Public Cloud WorkshopPublic Cloud Workshop
Public Cloud Workshop
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
 
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
 
WEB-DBMS A quick reference
WEB-DBMS A quick referenceWEB-DBMS A quick reference
WEB-DBMS A quick reference
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan 2016
Sergey Stoyan 2016
 
Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan 2016
Sergey Stoyan 2016
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen Apps
 
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
 
Ikenstudiolive
IkenstudioliveIkenstudiolive
Ikenstudiolive
 
Census Bureau PBOCS
Census Bureau PBOCSCensus Bureau PBOCS
Census Bureau PBOCS
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Khushali Patel-resume-
Khushali Patel-resume-Khushali Patel-resume-
Khushali Patel-resume-
 

Recently uploaded

Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 

Recently uploaded (20)

Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 

Distributed Crawler Service architecture presentation

  • 1. The Distributed Crawler v2.5.x-chaika Service architecture overview The Hierarchical Cluster Engine project IOIX Ukraine, 2014-2015
  • 2. Introduction The HCE-DC is a multipurpose high productivity scalable and extensible engine of web data collecting and processing. Built on several HCE project's products and technologies: ● The Distributed Tasks Manager (DTM) service. ● The hce-node network cluster application. ● The API bindings for Python and PHP languages. ● The tools and library of crawling algorithms. ● The tools and library of scraping algorithms. ● Web administration console. ● The real-time REST API. provides flexible configuration and deployment automation to get installation closed to a target project and easy integration.
  • 3. ● Crawling - scan the web sites, analyze and parse pages, detect and collect URLs links and web resources' data. Download resources from web-servers using collected or provided with request URL(s) and store them in local raw data file storage. Everything on multi-host and multi-process architecture ● Process web page contents with several customizable applied algorithms like a unstructured textual content scraping and store results in local sql db storage on multi-host and multi-process architecture. ● Manage tasks of crawling and processing with scheduling and balancing using tasks management service of multi-host architecture or real-time multi-threaded load-balancing client-server architecture with multi-host backend engine. The main functional purposes
  • 4. The extended functionality ● Developer's API – full access for configuration, deployment, monitoring and management processes and data. ● Applied API – full featured multi-thread multi-host REST http-based protocol to perform crawling and scraping batch requests. ● Web administration console - the DC and DTM services and user's accounts, roles, permissions, crawling, scraping, results collect, aggregation and convert management, statistical data collect and visualize, notification, triggering and another utility tools. ● Helper tools and libraries – several support applied utility.
  • 5. Distributed asynchronous nature The HCE-DC engine itself is an architecturally fully distributed system. It can be deployed and configured as single- and multi-host installation. Key features and properties of distributed architecture: ● No central database or data storage for crawling and processing. Each physical host unit with the same structures shards data but represented as single service. ● Crawling and processing goes on several physical hosts parallel multi-process way including downloading, fetching, DOM parsing, URLs collecting, fields extracting, post processing, metric calculations and so on tasks. ● Customizable strategies of data sharding and requests balancing with minimization of data redundancy and optimization of resources usage. ● Reducing of pages and scraped contents internally with smart merging avoiding of resources duplicates in fetch data client response.
  • 6. Flexible balancing and scalability The HCE-DC as service can be deployed on set of physical hosts. The number of hosts depends on their hardware productivity rate (CPU cores number, disk space size, network interface speed and so on) and can be from one to tens or more. Key scalability principles are: – The hardware computational unit is physical or logical host (any kind of virtualization and containers supported). – The hardware units can be added in to the system and gradually filled with data during regular crawling iterations at run-time. No dedicated data migration. – Computational tasks balancing is resource usage optimized. Tasks scheduler selects computational unit with maximum free system resources using customizable estimation formula. Different system resources usage indicators available: CPU, RAM, DISK, IO wait, processes number, threads number and so on.
  • 7. Extensible software and algorithms The HCE-DC service for the Linux OS platform has three main parts: ● The core service daemon modules with functionality of: scheduler of crawling tasks, manager of tasks queues, managers of periodical processes, manager of computational units data, manager of storage resources aging, manager of real-time API requests and so on. Typically the core daemon process runs on dedicated host and represents service itself. ● The computational unit modules set including crawling crawler-task, scraping algorithms processor-task and several scraper modules, storage management db-task, pre-rocessor, finalizer modules and several additional helper utilities. They acts on session-based principles and exits after the input batch data set processed.
  • 8. Open processing architecture The computational modules set can be extended with any kind of algorithms, libraries and frameworks for any platforms and programming languages. The limitation is only the API interaction translation that typically needs some adapters. The key principles are: ● Data processing modules involved as native OS processes or via API including REST and CLI. ● Process instances are isolated. ● POSIX CLI API is default for inter-process data exchange or simulated by converter utilities. ● Open input/output protocol used to process batch in sequential way step by step by each processing chain. ● Streaming open data formats can be easily serialized – json, xml and so on.
  • 9. General DC service architecture
  • 11. Internal DC service architecture
  • 12. Brief list of main DC features Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling, HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits (size, pages, contents, errors, URLs, redirects, content types), requests delaying, robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and batching. Fully automated distributed data processing: News ™ article (pre-defined sequential scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™ universal (definitions of tags and rules to extract data from pages based on xpath and csspath, content parts joining, merging, best result selection, regular expressions post processing, multi-item pages (product, search results, etc), multi-rule, multi-template) scraping engines, WYSIWYG templates editor, processed contents selection and merging and extensible processing modules architecture. Fully automated resources data management: periodic operations, data aging, update, re-crawling and re-processing. Web administration console: full CRUD of projects for data collect and process with set of parameters per project, users with roles and permissions ACL, DC and DTM service's statistics, crawling and processing project's statistics. Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to process, full featured parameters set with additionally limitations per user account and authorization state. Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
  • 13. Statistics of three physical hosts installation for one month ● Projects: 8 ● Pages crawled: 6.2M ● Crawling batches: 60K ● Processing batches: 90K ● Purging batches: 16K ● Aging batches: 16K ● Projects re-crawlings: 30K ● CPU Load Average: 0.45 avg / 3.5 max ● CPU utilization: 3% avg / 30% max ● I/O wait time: 0.31 avg / 6.4 max ● Network connections: 250 avg / 747 max ● Network traffic: 152Kbps avg / 5.5Mbps max ● Data hosts: 2 ● Load-balancing of system OS resources linear managed CPU load average, I/O wait and RAM usage without excesses and overloads. ● Linear scalability of real-time requests per physical host. ● Linear scalability of automated crawling, processing and aging per physical host.