MEDICAL INFORMATICS
UNIVERSITY OF HEIDELBERG
HEILBRONN UNIVERSITY
Harvesting Online Health Information
Focused Web-Crawling – The Technical Perspective
Richard Zowalla
Outline of this talk…
 Background & Motiviation
 Recap / Methods:
 (Focused) Web-Crawling
 Basic Crawler Architecture
 Focused Web-Crawling @ HHN
 Introduction to Apache Storm
 Introduction to StormCrawler & StormCrawler@HHN
 Technical Details / Hardware
 Conclusion & Live-Demo
2
Which size has the Web?
 2015: the (indexed) Web is estimated to contain 47 billion Web-pages [5]
and its structure was analyzed in various studies…
 The overall structure of the Web is a bow-tie structure (since 2000) [6,7]
[5] Van den Bosch, A., Bogers, T., & De Kunder, M. (2016). Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, 1-18.
[6] Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J. Graph structure in the web. Computer Networks. 2000;33(1):309–320. doi:
10.1016/S1389-1286(00)00083-9
[7] Hirate Y, Kato S, Yamana H. Web structure in 2005. In: Aiello W, Broder A, Janssen J, Milios E, editors. Algorithms and models for the web-graph, lecture notes in
computer science. San Diego: Springer; 2008. pp. 36–46.
3
Why analyzing „Health Web“?
 We want to know:
 Who is important in the Health Web?
 Who provides health-related information? How credible is the source?
 Which topics are relevant in the Health Web?
 Other possible use-cases:
 build a domain-specific search-engine :)
 Computer linguistic analysis (i.e. word frequencies, readability, …)
 Text-corpora for other „data mining“ approaches
4
Methods: Web-Crawling
5
Methods: Random Health Surfer (1)
6
Information Need
Methods: Random Health Surfer (2)
7
Methods: Random Health Surfer (3)
8
X
Methods: Focused Web-Crawling
[8] Menczer F. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information, Discovery; 1997.
[9] Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specifc Web Resource Discovery. Comput Netw. 1999;31(11-16):1623,1640.
9
Methods: How to obtain (good) seeds?
 DNS Zone Files
 Web-Directories, e.g. DMOZ, curlie
 pre-categorized by humans, RDF available
 poorly maintained, a lot of dead links
 Common Crawl Project
 a lot of raw data for different domains available
 Search-Engine based seed generation
 build queries related to your domain of interest
 use the top results as seeds…
 commercial influence…
 Reuse data from previous crawls…
10
Methods: Basic Crawler Architecture
11
 implementation issues & pitfalls
 politeness
 wrong / broken HTML markup
 dynamic content (JavaScript)
 spider traps
 technical issues
 DNS (UDP vs. TCP) / DNS Caching
 Disk I/O throughput
 Network Bandwidth
Let‘s scale…
If you have enough time,
everything can be done on a single machine
but …
we don‘t want to wait for weeks :)
12
CPU
13
Memory
14
Bandwidth
15
Storage
16
Examples at Heilbronn University (1)
Semantic Health Crawler (SHC)
17
SHC - What is it?
 distributed, focused crawler for analyzing the German Health Web [18]
 based on StormCrawler
 combines different relevance computation methods to estimate priority
 SVM-based text-classifier to compute topic relevance
 hosted and run by Heilbronn University
 Info-Page can be found here:
https://shc-info.gecko.hs-heilbronn.de/
18
[18] Zowalla R, Wiesner M, Pfeifer D. Analyzing the German Health Web using a Focused Crawling Approach. HEC 2016; Health - Exploring
Complexity: An Interdisciplinary Systems Approach, 28 August - 2 September 2016; Munich. Eur J Epidemiol (2016) 31:S1–S239. DOI:10.1007/s10654-
016-0183-1
The Battle of the Crawlers
19Source: http://digitalpebble.blogspot.com/2017/01/the-battle-of-crawlers-apache-nutch-vs.html
 Batch-Processing (Apache Nutch) versus Stream-Processing (StormCrawler)
Nutch fetched 6,038 pages per min. StormCrawler fetched 9792.26 pages per min.
What is Apache Storm? (1) - Terminology
 distributed (realtime) stream processing framework

20
Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
What is Apache Storm? (2) - Architecture
 fault tolerant & scalable

21
Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
What is Apache Storm? (3) - Architecture
22
Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
Apache Storm & Web-Crawler (simplified)
23
SHC: High Level Architecture
24
SHC: Crawl Topology (simplified)
25
Example: The Hell of Whitespace...
26
SHC: Virtualization Layer
27
 Virtualizationplattform: VMWare ESXi / vSphere
 ~ 45 virtual machines for different requirements
 Persistence Layer

ElasticSearch Cluster

Neo4J Enterprise Cluster

PostgreSQL Cluster
 Streaming Layer (Apache Storm)

Zookeeper / Nimbus Cluster

Supervisors
 Administration Layer

Notification & Alerting (Monitoring)

Backup Services
SHC: System Performance

28
www.rki.de
www.bmg.bund.de
www.impfen-info.de
www.kindergesundheit-info.de
SHC: Graph Structures (d=3)
29
Take home message…
 Extracting information from unstructured (textual) data requires a
lot of computational effort
 processing & analysis software frameworks need to
 (a) handle such computational load
 (b) scale with the data (high throughput)
 Often, there is the need to build such scalable software from
scratch as existing frameworks cannot deal with (a) or (b).
30
Live-Demo
31

dev-shc-master (192.168.247.10)

Zookeeper, Nimbus, Supervisor (http://192.168.247.10:8080/index.html)

Neo4J Enterprise (http://192.168.247.10:8080/index.html)

ElasticSearch

Kibana (http://192.168.247.10:5601/app/kibana)

dev-shc-worker-1 (192.168.247.11)

Zookeeper, Nimbus, Supervisor

dev-shc-worker-2 (192.168.247.12)

Zookeeper, Nimbus, Supervisor
Contact
Richard Zowalla
Dept. of Medical Informatics
Heilbronn University
Max-Planck-Str. 39
D-74081 Heilbronn
Twitter: @zowalla
Mail: richard.zowalla@hs-heilbronn.de
Web: http://www.mi.hs-heilbronn.de/
32

Harvesting Online Health Information

  • 1.
    MEDICAL INFORMATICS UNIVERSITY OFHEIDELBERG HEILBRONN UNIVERSITY Harvesting Online Health Information Focused Web-Crawling – The Technical Perspective Richard Zowalla
  • 2.
    Outline of thistalk…  Background & Motiviation  Recap / Methods:  (Focused) Web-Crawling  Basic Crawler Architecture  Focused Web-Crawling @ HHN  Introduction to Apache Storm  Introduction to StormCrawler & StormCrawler@HHN  Technical Details / Hardware  Conclusion & Live-Demo 2
  • 3.
    Which size hasthe Web?  2015: the (indexed) Web is estimated to contain 47 billion Web-pages [5] and its structure was analyzed in various studies…  The overall structure of the Web is a bow-tie structure (since 2000) [6,7] [5] Van den Bosch, A., Bogers, T., & De Kunder, M. (2016). Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, 1-18. [6] Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J. Graph structure in the web. Computer Networks. 2000;33(1):309–320. doi: 10.1016/S1389-1286(00)00083-9 [7] Hirate Y, Kato S, Yamana H. Web structure in 2005. In: Aiello W, Broder A, Janssen J, Milios E, editors. Algorithms and models for the web-graph, lecture notes in computer science. San Diego: Springer; 2008. pp. 36–46. 3
  • 4.
    Why analyzing „HealthWeb“?  We want to know:  Who is important in the Health Web?  Who provides health-related information? How credible is the source?  Which topics are relevant in the Health Web?  Other possible use-cases:  build a domain-specific search-engine :)  Computer linguistic analysis (i.e. word frequencies, readability, …)  Text-corpora for other „data mining“ approaches 4
  • 5.
  • 6.
    Methods: Random HealthSurfer (1) 6 Information Need
  • 7.
  • 8.
    Methods: Random HealthSurfer (3) 8 X
  • 9.
    Methods: Focused Web-Crawling [8]Menczer F. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information, Discovery; 1997. [9] Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specifc Web Resource Discovery. Comput Netw. 1999;31(11-16):1623,1640. 9
  • 10.
    Methods: How toobtain (good) seeds?  DNS Zone Files  Web-Directories, e.g. DMOZ, curlie  pre-categorized by humans, RDF available  poorly maintained, a lot of dead links  Common Crawl Project  a lot of raw data for different domains available  Search-Engine based seed generation  build queries related to your domain of interest  use the top results as seeds…  commercial influence…  Reuse data from previous crawls… 10
  • 11.
    Methods: Basic CrawlerArchitecture 11  implementation issues & pitfalls  politeness  wrong / broken HTML markup  dynamic content (JavaScript)  spider traps  technical issues  DNS (UDP vs. TCP) / DNS Caching  Disk I/O throughput  Network Bandwidth
  • 12.
    Let‘s scale… If youhave enough time, everything can be done on a single machine but … we don‘t want to wait for weeks :) 12
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    Examples at HeilbronnUniversity (1) Semantic Health Crawler (SHC) 17
  • 18.
    SHC - Whatis it?  distributed, focused crawler for analyzing the German Health Web [18]  based on StormCrawler  combines different relevance computation methods to estimate priority  SVM-based text-classifier to compute topic relevance  hosted and run by Heilbronn University  Info-Page can be found here: https://shc-info.gecko.hs-heilbronn.de/ 18 [18] Zowalla R, Wiesner M, Pfeifer D. Analyzing the German Health Web using a Focused Crawling Approach. HEC 2016; Health - Exploring Complexity: An Interdisciplinary Systems Approach, 28 August - 2 September 2016; Munich. Eur J Epidemiol (2016) 31:S1–S239. DOI:10.1007/s10654- 016-0183-1
  • 19.
    The Battle ofthe Crawlers 19Source: http://digitalpebble.blogspot.com/2017/01/the-battle-of-crawlers-apache-nutch-vs.html  Batch-Processing (Apache Nutch) versus Stream-Processing (StormCrawler) Nutch fetched 6,038 pages per min. StormCrawler fetched 9792.26 pages per min.
  • 20.
    What is ApacheStorm? (1) - Terminology  distributed (realtime) stream processing framework  20 Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
  • 21.
    What is ApacheStorm? (2) - Architecture  fault tolerant & scalable  21 Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
  • 22.
    What is ApacheStorm? (3) - Architecture 22 Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
  • 23.
    Apache Storm &Web-Crawler (simplified) 23
  • 24.
    SHC: High LevelArchitecture 24
  • 25.
    SHC: Crawl Topology(simplified) 25
  • 26.
    Example: The Hellof Whitespace... 26
  • 27.
    SHC: Virtualization Layer 27 Virtualizationplattform: VMWare ESXi / vSphere  ~ 45 virtual machines for different requirements  Persistence Layer  ElasticSearch Cluster  Neo4J Enterprise Cluster  PostgreSQL Cluster  Streaming Layer (Apache Storm)  Zookeeper / Nimbus Cluster  Supervisors  Administration Layer  Notification & Alerting (Monitoring)  Backup Services
  • 28.
  • 29.
  • 30.
    Take home message… Extracting information from unstructured (textual) data requires a lot of computational effort  processing & analysis software frameworks need to  (a) handle such computational load  (b) scale with the data (high throughput)  Often, there is the need to build such scalable software from scratch as existing frameworks cannot deal with (a) or (b). 30
  • 31.
    Live-Demo 31  dev-shc-master (192.168.247.10)  Zookeeper, Nimbus,Supervisor (http://192.168.247.10:8080/index.html)  Neo4J Enterprise (http://192.168.247.10:8080/index.html)  ElasticSearch  Kibana (http://192.168.247.10:5601/app/kibana)  dev-shc-worker-1 (192.168.247.11)  Zookeeper, Nimbus, Supervisor  dev-shc-worker-2 (192.168.247.12)  Zookeeper, Nimbus, Supervisor
  • 32.
    Contact Richard Zowalla Dept. ofMedical Informatics Heilbronn University Max-Planck-Str. 39 D-74081 Heilbronn Twitter: @zowalla Mail: richard.zowalla@hs-heilbronn.de Web: http://www.mi.hs-heilbronn.de/ 32