MEDICAL INFORMATICS
UNIVERSITY OFHEIDELBERG
HEILBRONN UNIVERSITY
Harvesting Online Health Information
Focused Web-Crawling – The Technical Perspective
Richard Zowalla
Which size hasthe Web?
2015: the (indexed) Web is estimated to contain 47 billion Web-pages [5]
and its structure was analyzed in various studies…
The overall structure of the Web is a bow-tie structure (since 2000) [6,7]
[5] Van den Bosch, A., Bogers, T., & De Kunder, M. (2016). Estimating search engine index size variability: a 9-year longitudinal study. Scientometrics, 1-18.
[6] Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J. Graph structure in the web. Computer Networks. 2000;33(1):309–320. doi:
10.1016/S1389-1286(00)00083-9
[7] Hirate Y, Kato S, Yamana H. Web structure in 2005. In: Aiello W, Broder A, Janssen J, Milios E, editors. Algorithms and models for the web-graph, lecture notes in
computer science. San Diego: Springer; 2008. pp. 36–46.
3
4.
Why analyzing „HealthWeb“?
We want to know:
Who is important in the Health Web?
Who provides health-related information? How credible is the source?
Which topics are relevant in the Health Web?
Other possible use-cases:
build a domain-specific search-engine :)
Computer linguistic analysis (i.e. word frequencies, readability, …)
Text-corpora for other „data mining“ approaches
4
Methods: Focused Web-Crawling
[8]Menczer F. ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information, Discovery; 1997.
[9] Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specifc Web Resource Discovery. Comput Netw. 1999;31(11-16):1623,1640.
9
10.
Methods: How toobtain (good) seeds?
DNS Zone Files
Web-Directories, e.g. DMOZ, curlie
pre-categorized by humans, RDF available
poorly maintained, a lot of dead links
Common Crawl Project
a lot of raw data for different domains available
Search-Engine based seed generation
build queries related to your domain of interest
use the top results as seeds…
commercial influence…
Reuse data from previous crawls…
10
11.
Methods: Basic CrawlerArchitecture
11
implementation issues & pitfalls
politeness
wrong / broken HTML markup
dynamic content (JavaScript)
spider traps
technical issues
DNS (UDP vs. TCP) / DNS Caching
Disk I/O throughput
Network Bandwidth
12.
Let‘s scale…
If youhave enough time,
everything can be done on a single machine
but …
we don‘t want to wait for weeks :)
12
SHC - Whatis it?
distributed, focused crawler for analyzing the German Health Web [18]
based on StormCrawler
combines different relevance computation methods to estimate priority
SVM-based text-classifier to compute topic relevance
hosted and run by Heilbronn University
Info-Page can be found here:
https://shc-info.gecko.hs-heilbronn.de/
18
[18] Zowalla R, Wiesner M, Pfeifer D. Analyzing the German Health Web using a Focused Crawling Approach. HEC 2016; Health - Exploring
Complexity: An Interdisciplinary Systems Approach, 28 August - 2 September 2016; Munich. Eur J Epidemiol (2016) 31:S1–S239. DOI:10.1007/s10654-
016-0183-1
19.
The Battle ofthe Crawlers
19Source: http://digitalpebble.blogspot.com/2017/01/the-battle-of-crawlers-apache-nutch-vs.html
Batch-Processing (Apache Nutch) versus Stream-Processing (StormCrawler)
Nutch fetched 6,038 pages per min. StormCrawler fetched 9792.26 pages per min.
20.
What is ApacheStorm? (1) - Terminology
distributed (realtime) stream processing framework
20
Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
21.
What is ApacheStorm? (2) - Architecture
fault tolerant & scalable
21
Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
22.
What is ApacheStorm? (3) - Architecture
22
Taken from https://jansipke.nl/storm-in-pictures/ (28.11.18)
Take home message…
Extracting information from unstructured (textual) data requires a
lot of computational effort
processing & analysis software frameworks need to
(a) handle such computational load
(b) scale with the data (high throughput)
Often, there is the need to build such scalable software from
scratch as existing frameworks cannot deal with (a) or (b).
30