Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Crawler Service architecture presentation


Published on

Distributed Crawler Service architecture. Describes the main architecture, key principles and features list of the Distributed Crawler software.

Published in: Software
  • Be the first to comment

Distributed Crawler Service architecture presentation

  1. 1. The Distributed Crawler v2.5.x-chaika Service architecture overview The Hierarchical Cluster Engine project IOIX Ukraine, 2014-2015
  2. 2. Introduction The HCE-DC is a multipurpose high productivity scalable and extensible engine of web data collecting and processing. Built on several HCE project's products and technologies: ● The Distributed Tasks Manager (DTM) service. ● The hce-node network cluster application. ● The API bindings for Python and PHP languages. ● The tools and library of crawling algorithms. ● The tools and library of scraping algorithms. ● Web administration console. ● The real-time REST API. provides flexible configuration and deployment automation to get installation closed to a target project and easy integration.
  3. 3. ● Crawling - scan the web sites, analyze and parse pages, detect and collect URLs links and web resources' data. Download resources from web-servers using collected or provided with request URL(s) and store them in local raw data file storage. Everything on multi-host and multi-process architecture ● Process web page contents with several customizable applied algorithms like a unstructured textual content scraping and store results in local sql db storage on multi-host and multi-process architecture. ● Manage tasks of crawling and processing with scheduling and balancing using tasks management service of multi-host architecture or real-time multi-threaded load-balancing client-server architecture with multi-host backend engine. The main functional purposes
  4. 4. The extended functionality ● Developer's API – full access for configuration, deployment, monitoring and management processes and data. ● Applied API – full featured multi-thread multi-host REST http-based protocol to perform crawling and scraping batch requests. ● Web administration console - the DC and DTM services and user's accounts, roles, permissions, crawling, scraping, results collect, aggregation and convert management, statistical data collect and visualize, notification, triggering and another utility tools. ● Helper tools and libraries – several support applied utility.
  5. 5. Distributed asynchronous nature The HCE-DC engine itself is an architecturally fully distributed system. It can be deployed and configured as single- and multi-host installation. Key features and properties of distributed architecture: ● No central database or data storage for crawling and processing. Each physical host unit with the same structures shards data but represented as single service. ● Crawling and processing goes on several physical hosts parallel multi-process way including downloading, fetching, DOM parsing, URLs collecting, fields extracting, post processing, metric calculations and so on tasks. ● Customizable strategies of data sharding and requests balancing with minimization of data redundancy and optimization of resources usage. ● Reducing of pages and scraped contents internally with smart merging avoiding of resources duplicates in fetch data client response.
  6. 6. Flexible balancing and scalability The HCE-DC as service can be deployed on set of physical hosts. The number of hosts depends on their hardware productivity rate (CPU cores number, disk space size, network interface speed and so on) and can be from one to tens or more. Key scalability principles are: – The hardware computational unit is physical or logical host (any kind of virtualization and containers supported). – The hardware units can be added in to the system and gradually filled with data during regular crawling iterations at run-time. No dedicated data migration. – Computational tasks balancing is resource usage optimized. Tasks scheduler selects computational unit with maximum free system resources using customizable estimation formula. Different system resources usage indicators available: CPU, RAM, DISK, IO wait, processes number, threads number and so on.
  7. 7. Extensible software and algorithms The HCE-DC service for the Linux OS platform has three main parts: ● The core service daemon modules with functionality of: scheduler of crawling tasks, manager of tasks queues, managers of periodical processes, manager of computational units data, manager of storage resources aging, manager of real-time API requests and so on. Typically the core daemon process runs on dedicated host and represents service itself. ● The computational unit modules set including crawling crawler-task, scraping algorithms processor-task and several scraper modules, storage management db-task, pre-rocessor, finalizer modules and several additional helper utilities. They acts on session-based principles and exits after the input batch data set processed.
  8. 8. Open processing architecture The computational modules set can be extended with any kind of algorithms, libraries and frameworks for any platforms and programming languages. The limitation is only the API interaction translation that typically needs some adapters. The key principles are: ● Data processing modules involved as native OS processes or via API including REST and CLI. ● Process instances are isolated. ● POSIX CLI API is default for inter-process data exchange or simulated by converter utilities. ● Open input/output protocol used to process batch in sequential way step by step by each processing chain. ● Streaming open data formats can be easily serialized – json, xml and so on.
  9. 9. General DC service architecture
  10. 10. Real-time client-server architecture
  11. 11. Internal DC service architecture
  12. 12. Brief list of main DC features Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling, HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits (size, pages, contents, errors, URLs, redirects, content types), requests delaying, robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and batching. Fully automated distributed data processing: News ™ article (pre-defined sequential scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™ universal (definitions of tags and rules to extract data from pages based on xpath and csspath, content parts joining, merging, best result selection, regular expressions post processing, multi-item pages (product, search results, etc), multi-rule, multi-template) scraping engines, WYSIWYG templates editor, processed contents selection and merging and extensible processing modules architecture. Fully automated resources data management: periodic operations, data aging, update, re-crawling and re-processing. Web administration console: full CRUD of projects for data collect and process with set of parameters per project, users with roles and permissions ACL, DC and DTM service's statistics, crawling and processing project's statistics. Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to process, full featured parameters set with additionally limitations per user account and authorization state. Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
  13. 13. Statistics of three physical hosts installation for one month ● Projects: 8 ● Pages crawled: 6.2M ● Crawling batches: 60K ● Processing batches: 90K ● Purging batches: 16K ● Aging batches: 16K ● Projects re-crawlings: 30K ● CPU Load Average: 0.45 avg / 3.5 max ● CPU utilization: 3% avg / 30% max ● I/O wait time: 0.31 avg / 6.4 max ● Network connections: 250 avg / 747 max ● Network traffic: 152Kbps avg / 5.5Mbps max ● Data hosts: 2 ● Load-balancing of system OS resources linear managed CPU load average, I/O wait and RAM usage without excesses and overloads. ● Linear scalability of real-time requests per physical host. ● Linear scalability of automated crawling, processing and aging per physical host.