Harvesting Online Health Information

MEDICAL INFORMATICS
UNIVERSITY OF HEIDELBERG
HEILBRONN UNIVERSITY
Harvesting Online Health Information
Focused Web-Crawling – The Technical Perspective
Richard Zowalla

Outline of this talk…
 Background & Motiviation
 Recap / Methods:
 (Focused) Web-Crawling
 Basic Crawler Architecture
 Focused Web-Crawling @ HHN
 Introduction to Apache Storm
 Introduction to StormCrawler & StormCrawler@HHN
 Technical Details / Hardware
 Conclusion & Live-Demo
2

Which size has the Web?
 2015: the (indexed) Web is estimated to contain 47 billion Web-pages [5]
and its structure was analyzed in various studies…
 The overall structure of the Web is a bow-tie structure (since 2000) [6,7]
[5] Van den Bosch, A., Bogers, T., & De Kunder, M. (2016). Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, 1-18.
[6] Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J. Graph structure in the web. Computer Networks. 2000;33(1):309–320. doi:
10.1016/S1389-1286(00)00083-9
[7] Hirate Y, Kato S, Yamana H. Web structure in 2005. In: Aiello W, Broder A, Janssen J, Milios E, editors. Algorithms and models for the web-graph, lecture notes in
computer science. San Diego: Springer; 2008. pp. 36–46.
3

Why analyzing „Health Web“?
 We want to know:
 Who is important in the Health Web?
 Who provides health-related information? How credible is the source?
 Which topics are relevant in the Health Web?
 Other possible use-cases:
 build a domain-specific search-engine :)
 Computer linguistic analysis (i.e. word frequencies, readability, …)
 Text-corpora for other „data mining“ approaches
4

Methods: Random Health Surfer (1)
6
Information Need

7

8
X

Methods: Focused Web-Crawling
[8] Menczer F. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information, Discovery; 1997.
[9] Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specifc Web Resource Discovery. Comput Netw. 1999;31(11-16):1623,1640.
9

Methods: How to obtain (good) seeds?
 DNS Zone Files
 Web-Directories, e.g. DMOZ, curlie
 pre-categorized by humans, RDF available
 poorly maintained, a lot of dead links
 Common Crawl Project
 a lot of raw data for different domains available
 Search-Engine based seed generation
 build queries related to your domain of interest
 use the top results as seeds…
 commercial influence…
 Reuse data from previous crawls…
10

Methods: Basic Crawler Architecture
11
 implementation issues & pitfalls
 politeness
 wrong / broken HTML markup
 dynamic content (JavaScript)
 spider traps
 technical issues
 DNS (UDP vs. TCP) / DNS Caching
 Disk I/O throughput
 Network Bandwidth

Let‘s scale…
If you have enough time,
everything can be done on a single machine
but …
we don‘t want to wait for weeks :)
12

Examples at Heilbronn University (1)
Semantic Health Crawler (SHC)
17

SHC - What is it?
 distributed, focused crawler for analyzing the German Health Web [18]
 based on StormCrawler
 combines different relevance computation methods to estimate priority
 SVM-based text-classifier to compute topic relevance
 hosted and run by Heilbronn University
 Info-Page can be found here:
https://shc-info.gecko.hs-heilbronn.de/
18
[18] Zowalla R, Wiesner M, Pfeifer D. Analyzing the German Health Web using a Focused Crawling Approach. HEC 2016; Health - Exploring
Complexity: An Interdisciplinary Systems Approach, 28 August - 2 September 2016; Munich. Eur J Epidemiol (2016) 31:S1–S239. DOI:10.1007/s10654-
016-0183-1

The Battle of the Crawlers
19Source: http://digitalpebble.blogspot.com/2017/01/the-battle-of-crawlers-apache-nutch-vs.html
 Batch-Processing (Apache Nutch) versus Stream-Processing (StormCrawler)
Nutch fetched 6,038 pages per min. StormCrawler fetched 9792.26 pages per min.

What is Apache Storm? (1) - Terminology
 distributed (realtime) stream processing framework

20
Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)

What is Apache Storm? (2) - Architecture
 fault tolerant & scalable

21

What is Apache Storm? (3) - Architecture
22

Apache Storm & Web-Crawler (simplified)
23

SHC: High Level Architecture
24

SHC: Crawl Topology (simplified)
25

Example: The Hell of Whitespace...
26

SHC: Virtualization Layer
27
 Virtualizationplattform: VMWare ESXi / vSphere
 ~ 45 virtual machines for different requirements
 Persistence Layer

ElasticSearch Cluster

Neo4J Enterprise Cluster

PostgreSQL Cluster
 Streaming Layer (Apache Storm)

Zookeeper / Nimbus Cluster

Supervisors
 Administration Layer

Notification & Alerting (Monitoring)

Backup Services

SHC: System Performance

28

www.rki.de
www.bmg.bund.de
www.impfen-info.de
www.kindergesundheit-info.de
SHC: Graph Structures (d=3)
29

Take home message…
 Extracting information from unstructured (textual) data requires a
lot of computational effort
 processing & analysis software frameworks need to
 (a) handle such computational load
 (b) scale with the data (high throughput)
 Often, there is the need to build such scalable software from
scratch as existing frameworks cannot deal with (a) or (b).
30

Live-Demo
31

dev-shc-master (192.168.247.10)

Zookeeper, Nimbus, Supervisor (http://192.168.247.10:8080/index.html)

Neo4J Enterprise (http://192.168.247.10:8080/index.html)

ElasticSearch

Kibana (http://192.168.247.10:5601/app/kibana)

dev-shc-worker-1 (192.168.247.11)

Zookeeper, Nimbus, Supervisor

dev-shc-worker-2 (192.168.247.12)

Zookeeper, Nimbus, Supervisor

Contact
Richard Zowalla
Dept. of Medical Informatics
Heilbronn University
Max-Planck-Str. 39
D-74081 Heilbronn
Twitter: @zowalla
Mail: richard.zowalla@hs-heilbronn.de
Web: http://www.mi.hs-heilbronn.de/
32

Harvesting Online Health Information

More Related Content

Recently uploaded

Featured

Harvesting Online Health Information