Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Crawling and Processing the Italian Corporate Web

238 views

Published on

SpazioDati collects public information about all Italian companies from many different sources, the most challenging being the World Wide Web  Our Internet Data Gathering project crawls and processes data from the entire Italian web, using distributed frameworks such as Hadoop, Nutch, Elasticsearch and Spark ✨ This talk will give an overview of the extraction pipeline and present some of the issues we tackled during and after development.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Crawling and Processing the Italian Corporate Web

  1. 1. Crawling and Processing the Italian Corporate Web Alessio Guerrieri SpazioDati S.R.L.
  2. 2. Your speaker ● Born in Trento ● Studied at UniTN and Georgia Tech ● PhD in Large Scale Graph Analytics ● Teaches Algorithms and Data Structures ● Data Scientist at SpazioDati In my spare time: ● {Read|Watch|Play} {Science Fiction|Fantasy} {Novels|TV|Board Games}
  3. 3. SpazioDati S.R.L. ● Born in 2012 ● Data integration ● Focus on corporate world: ○ Official data from Camera di Commercio ○ Open data ● Atoka ○ B2B database of company information ○ Sales intelligence ○ API ● Data analytics ○ Portfolio analysis ○ Lead generation ○ Risk evaluation Always hard at work!
  4. 4. Internet Data Gathering (IDG) IDG is an internal project to gather, process and organize internet data about italian companies. It uses many different technologies for Big Data Gathering and Processing. Entire pipeline runs on Amazon AWS A representation of the Internet
  5. 5. Internet Data Gathering (IDG) Takeaways: ● Web data is HORRIBLE ● OSS can help! ● For Big Data, you need a Big Framework
  6. 6. Crawling the Corporate Web
  7. 7. Web Crawler Image from https://en.wikipedia.org/wiki/Web_crawler
  8. 8. Apache Nutch ● Distributed crawler runnable on Hadoop ● Highly configurable Each iteration: 1. Injector adds new Urls 2. Generator runs Scoring Function to select Urls 3. Urls are divided in segments 4. Each segment is downloaded in parallel 5. Pages are parsed 6. Newly discovered urls are added to CrawlDB
  9. 9. Apache Nutch ● Distributed crawler runnable on Hadoop ● Highly configurable Each iteration: 1. Injector adds new Urls 2. Generator runs Scoring Function to select Urls 3. Urls are divided in segments 4. Each segment is downloaded in parallel 5. Pages are parsed 6. Newly discovered urls are added to CrawlDB
  10. 10. Nutch in SpazioDati ● Restricted to: ○ .it domains ○ domains registered in Italy (through whois) ● Runs weekly: ○ Cluster of 15 machines ○ Use Elastic MapReduce service ○ 12M pages each week ● Keep complete history ○ 5.3T downloaded ○ After 4 months pages are not processed
  11. 11. Crawling is not easy! Issues with crawling: ● People who do not want to be crawled ○ Be polite! ○ We follow robots.txt specification and use unique User Agent ● Avoid accidental DDOS attacks ○ Each domain should be crawled sequentially ● Never crawl too deeply ○ Filters on depth, url length and queries ○ Try to avoid crawling too much a single domain “The crawlers delved too greedily and too deep” https://www.amazon.it/s/ref=lp_1345828031_nr_p_n_binding_browse-b_0 ?fst=as%3Aoff&rh=n%3A411663031%2Cn%3A%21411664031%2Cn%3A 1345828031%2Cp_n_binding_browse-bin%3A509801031&bbn=1345828 031&ie=UTF8&qid=1504078452&rnid=509800031
  12. 12. Processing the Corporate Web
  13. 13. Extracting data from Crawl Crawler gives us compressed json of HTML with metadata ● Structured, useful information ● Domain based ● Distributed processing Easy information Medium information Complex information Text Social Accounts Technologies Links Logo Entities Codici Fiscali Language People
  14. 14. Hadoop for data processing: ● User defines User Defined Functions ● Hadoop framework ○ Stores input data ○ Divides it in chunkes ○ Makes it available to all machines ○ Runs UDFs on all chunkes ○ Guarantees fault tolerance ○ Collects output Hadoop This guy does not have the energy to implement fault tolerance...
  15. 15. PIG Scripting language for Hadoop ● Scripts are written in Pig Latin ● Looks kinda like SQL ● Easy built pipelines input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  16. 16. Pig in SpazioDati Our pipeline: 1. Computes domain for each page 2. Groups by domain 3. Extracts information for each domain 4. Integrates data from other sources (i.e. whois) 5. Exports a json for each domain ● Runs (roughly) monthly ● Cluster of 30 machines ● AWS’s Elastic MapReduce service ● Difficult to test :(
  17. 17. Querying the Corporate Web
  18. 18. Requirements We want to index our extracted data. ● We should access it easily ● We should explore it efficiently We will able to: ● Match it with official data about companies ● Serve it in the backend of our services 5M jsons without indexing
  19. 19. Elasticsearch Open source search engine ● Based on Lucene index ○ Highly efficient index ○ Mostly on disk ● Full text search ● Nested fields support ● Cluster structure ● Web interface ● Allows (very) complex queries 5M indexed jsons
  20. 20. Sample query Domains that contain the word ‘speck’ in the text: { "_source": false, "query":{ "term":{ "text": "speck" } }, "size": 5 } { "hits": { "total": 15069, "max_score": 11.716405, "hits": [ { "_id": "www.titospeck.it", "_score": 11.716405 }, { "_id": "derpsairer.it", "_score": 11.6602 }, { "_id": "www.speck.it", "_score": 11.626965 }, { "_id": "www.bayona-music.com", "_score": 11.607182 }, { "_id": "www.salumificiocoati.it", "_score": 11.560882 } ] } }
  21. 21. Sample query (2) Domains that contain the phrases similar to speck and tech in the text: { "_source": false, "query":{ "term":{ "text": "speck and tech" } }, "size": 3 } { "hits": { "total": 1003897, "max_score" : 19.871191, "hits": [ { "_id": "speckand.tech" , "_score": 19.871191 }, { "_id": "www.speckietechies.com" , "_score": 19.674822 }, { "_id": "francescobonadiman.com" , "_score": 17.935522 } ] } }
  22. 22. Complex query { "size": 0, "query":{ "bool":{ "must":[ { "term":{ "technologies.cms.name" : "WordPress" } }, { "term":{ "technologies.cms.version" :"3.0" } } ] } } } { "took": 1, "timed_out" : false, "_shards": { "total": 10, "successful" : 10, "failed": 0 }, "hits": { "total": 211, "max_score" : 0, "hits": [] } }
  23. 23. Complex query Compute the distribution of most used cms software { "size": 0, "aggregations" : { "aggs" : { "terms": { "field" : "technologies.cms.name" , "size" : 20 } } } } { "aggregations" : { "aggs": { "doc_count_error_upper_bound" : 997, "sum_other_doc_count" : 43403, "buckets" : [ { "key": "WordPress" , "doc_count" : 590133 }, { "key": "Joomla" , "doc_count" : 163595 }, { "key": "Drupal" , "doc_count" : 33727 }, { "key": "DM Polopoly" , "doc_count" : 30455 }, { "key": "Weebly" , "doc_count" : 9861 } ] } } }
  24. 24. Getting value from the Corporate Web
  25. 25. The rest of the IDG pipeline IDG is much more: ● Finding the correct domains for each company ● Extracting information from social networks ● Validating emails collected in the web ● ecc… The real IDG pipeline
  26. 26. Conclusions ● There is a lot of Open Source Software for Big Data processing ● You’ll need to tinker with available features ● Web data is often: ○ Outdated ○ Badly formatted ○ Ambiguous
  27. 27. Thanks for your attention! Questions? Interested? see www.spaziodati.eu/jobs for opportunities!

×