Large scale net_archive_toke_eskildsen_iipc_workshop_2015

Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 1/26
Scaling Net Archive
Indexing & Search
IIPC Technical Training Workshop 2014
@TokeEskildsen
Low-level search guy
(boss says “System Architect”)

Toke Eskildsen
Scaling SolrCloud indexing
● CPU for analysis, bulk read & write for Solr
● Homogeneous shards (law of large numbers)
● Solr index update entry point might be
bottleneck (so use more entry points)
● Routing overhead
● Splitting and moving shards
● Schema changes might require parallel rebuild

Toke Eskildsen
Static independent shards
● Easy scaling
– Predictable resource requirements
● Selective shard rebuilding
● Trivial backup
● Lower overall requirements
– Half the JVM heap requirements
– Single segment→Higher performance
– Less disk cache competition
● Temporal locality
– Better disk cache utilization with few users
– Hot spot problem with more users
– Ranking suffers (in theory)
● No document-level updates! ~250M docs / 900GB shard

Toke Eskildsen
Static independent shards search
Shard 01
Shard 02
Shard 03
Searcher 1
ZooKeeper

Toke Eskildsen
Building static shards
● Not standard Solr
● Sample setup (distribution optional)
– 24 CPU cores (more would be nice)
– 1 Solr indexer @ 40 GB RAM
– 1 Archon tracking (W)ARC files
– 1 Arctika controlling webarchive-discovery (Tika)
– 40 webarchive-discovery (Tika) @ 1 GB RAM
– Final shards: 250M docs, 900GB, fully optimized
Archon + Arctika: https://github.com/netarchivesuite/netsearch

Toke Eskildsen
Static independent shards index
Shard 4
Indexer 1
Shard 5
Indexer 2
Shard 1
Shard 2
Shard 3
Searcher 1
WAD = webarchive-discovery from UKWA: https://github.com/ukwa/webarchive-discovery
WAD 1
Arctika 1
WAD 2...
WAD n
WAD 1
Arctika 1
WAD 2...
WAD n
ARC-path
ARC-path
ARC-path
ARC-path
Archon
ARC 1
Storage
ARC 2...
ARC n

Toke Eskildsen
Measuring search performance
● Mimick real world scenarios
– Unique queries
● preferably logged from production
– Warmed caches
– Concurrent searches (if relevant)
– Measured time, not reported Qtime
● Capture setup data
– Index size, shard count, document count, free cache
memory, sar logs

Toke Eskildsen
Predicting scaling requirements
● All else is rarely equal
– Disk cache / index size ratio
– CPU cores / shard
– Slowest shard dictates total response time
● 3 or more measurement points
● Use 2 or more shards
● Visualize measurements

Toke Eskildsen
SolrCloud distributed search
● Phase 1
– Tophits calculation (fast)
– Simple faceting (medium to slow)
● Phase 2
– Document resolving (fast)
– Facet fine count (medium to very slow)
● Coordination and merge overhead

Toke Eskildsen
Interval popularity (aka long tail)

Toke Eskildsen
ms over time

Toke Eskildsen
hits, ms

Toke Eskildsen
log(hits), ms

Toke Eskildsen
Bucketed percentiles (candlesticks)

Toke Eskildsen
Abstract search hardware
● IOPS
– Needed for concurrent users and/or many shards
● Latency
– 1 request = 1 thread / shard (lying a bit)
– Lower latency → more IOPS
● Tapes < Spinning drives < SSDs < RAM
– But the truth is in the mix

Toke Eskildsen
Case study: Net Archive Search at
State and University Library, Denmark

Toke Eskildsen
Standard request
● Free-text matching in 6 fields
● Phrase matching i 1 field
● Grouping on URL (not used in the tests)
● Faceting
– URL (~6b uniques, 7b references)
– Host & domain (millions of uniques, 7b references)
– 3 small ones (year, format, public suffix)

Toke Eskildsen
Solr version & schema
● Solr 4.8.1 + SOLR-5894 patch (optional)
● Piggy backing UKWA work
● DocValues on all large facet fields (essential)

Toke Eskildsen
Clever Solr config tweaks
This space intentionally left blank

Toke Eskildsen
CPU

Toke Eskildsen
Disk cache
RAM %index mean median
110 0.49 658 141
98 0.44 1004 170
54 0.24 2164 361
27 0.12 5620 913
7 0.03 8546 3012

Toke Eskildsen
Concurrent requests

Toke Eskildsen
Concurrent requests (less faceting)

Toke Eskildsen
Faceting impact mitigation
Sparse faceting: http://tokee.github.io/lucene-solr/

Toke Eskildsen
Fewer, smaller facets

Toke Eskildsen
● Measure thrice & visualise
● Common Solr rules of thumbs are not always
applicable at Net Archive scale
● Static shards makes scaling easier
● SSDs works very well for us (22TB costs £7500)
● Full distributed faceting is doable but heavy
Danish Net Archive: http://netarkivet.dk/in-english/
More Solr tech talk: http://sbdevel.wordpress.com

Large scale net_archive_toke_eskildsen_iipc_workshop_2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Large scale net_archive_toke_eskildsen_iipc_workshop_2015

Similar to Large scale net_archive_toke_eskildsen_iipc_workshop_2015 (20)

Recently uploaded

Recently uploaded (20)

Large scale net_archive_toke_eskildsen_iipc_workshop_2015

Editor's Notes