SlideShare a Scribd company logo
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 1/26
Scaling Net Archive
Indexing & Search
IIPC Technical Training Workshop 2014
@TokeEskildsen
Low-level search guy
(boss says “System Architect”)
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 2/26
Scaling SolrCloud indexing
● CPU for analysis, bulk read & write for Solr
● Homogeneous shards (law of large numbers)
● Solr index update entry point might be
bottleneck (so use more entry points)
● Routing overhead
● Splitting and moving shards
● Schema changes might require parallel rebuild
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 3/26
Static independent shards
● Easy scaling
– Predictable resource requirements
● Selective shard rebuilding
● Trivial backup
● Lower overall requirements
– Half the JVM heap requirements
– Single segment→Higher performance
– Less disk cache competition
● Temporal locality
– Better disk cache utilization with few users
– Hot spot problem with more users
– Ranking suffers (in theory)
● No document-level updates! ~250M docs / 900GB shard
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 4/26
Static independent shards search
Shard 01
Shard 02
Shard 03
Searcher 1
ZooKeeper
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 5/26
Building static shards
● Not standard Solr
● Sample setup (distribution optional)
– 24 CPU cores (more would be nice)
– 1 Solr indexer @ 40 GB RAM
– 1 Archon tracking (W)ARC files
– 1 Arctika controlling webarchive-discovery (Tika)
– 40 webarchive-discovery (Tika) @ 1 GB RAM
– Final shards: 250M docs, 900GB, fully optimized
Archon + Arctika: https://github.com/netarchivesuite/netsearch
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 6/26
Static independent shards index
Shard 4
Indexer 1
Shard 5
Indexer 2
Shard 1
Shard 2
Shard 3
Searcher 1
WAD = webarchive-discovery from UKWA: https://github.com/ukwa/webarchive-discovery
WAD 1
Arctika 1
WAD 2...
WAD n
WAD 1
Arctika 1
WAD 2...
WAD n
ARC-path
ARC-path
ARC-path
ARC-path
Archon
ARC 1
Storage
ARC 2...
ARC n
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 7/26
Measuring search performance
● Mimick real world scenarios
– Unique queries
● preferably logged from production
– Warmed caches
– Concurrent searches (if relevant)
– Measured time, not reported Qtime
● Capture setup data
– Index size, shard count, document count, free cache
memory, sar logs
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 8/26
Predicting scaling requirements
● All else is rarely equal
– Disk cache / index size ratio
– CPU cores / shard
– Slowest shard dictates total response time
● 3 or more measurement points
● Use 2 or more shards
● Visualize measurements
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 9/26
SolrCloud distributed search
● Phase 1
– Tophits calculation (fast)
– Simple faceting (medium to slow)
● Phase 2
– Document resolving (fast)
– Facet fine count (medium to very slow)
● Coordination and merge overhead
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 10/26
Interval popularity (aka long tail)
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 11/26
ms over time
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 12/26
hits, ms
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 13/26
log(hits), ms
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 14/26
Bucketed percentiles (candlesticks)
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 15/26
Abstract search hardware
● IOPS
– Needed for concurrent users and/or many shards
● Latency
– 1 request = 1 thread / shard (lying a bit)
– Lower latency → more IOPS
● Tapes < Spinning drives < SSDs < RAM
– But the truth is in the mix
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 16/26
Case study: Net Archive Search at
State and University Library, Denmark
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 17/26
Standard request
● Free-text matching in 6 fields
● Phrase matching i 1 field
● Grouping on URL (not used in the tests)
● Faceting
– URL (~6b uniques, 7b references)
– Host & domain (millions of uniques, 7b references)
– 3 small ones (year, format, public suffix)
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 18/26
Solr version & schema
● Solr 4.8.1 + SOLR-5894 patch (optional)
● Piggy backing UKWA work
● DocValues on all large facet fields (essential)
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 19/26
Clever Solr config tweaks
This space intentionally left blank
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 20/26
CPU
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 21/26
Disk cache
RAM %index mean median
110 0.49 658 141
98 0.44 1004 170
54 0.24 2164 361
27 0.12 5620 913
7 0.03 8546 3012
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 22/26
Concurrent requests
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 23/26
Concurrent requests (less faceting)
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 24/26
Faceting impact mitigation
Sparse faceting: http://tokee.github.io/lucene-solr/
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 25/26
Fewer, smaller facets
Toke Eskildsen
te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 26/26
● Measure thrice & visualise
● Common Solr rules of thumbs are not always
applicable at Net Archive scale
● Static shards makes scaling easier
● SSDs works very well for us (22TB costs £7500)
● Full distributed faceting is doable but heavy
Danish Net Archive: http://netarkivet.dk/in-english/
More Solr tech talk: http://sbdevel.wordpress.com

More Related Content

What's hot

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Databricks
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
InfluxDB Live Product Training
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product Training
InfluxData
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Time Series Tech Stack for the IoT Edge
Time Series Tech Stack for the IoT EdgeTime Series Tech Stack for the IoT Edge
Time Series Tech Stack for the IoT Edge
InfluxData
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
Paris Data Engineers !
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
DataWorks Summit
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Centralised logging with ELK stack
Centralised logging with ELK stackCentralised logging with ELK stack
Centralised logging with ELK stack
Simon Hanmer
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
InfluxData
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
kbajda
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Log ingestion kafka -- impala using apex
Log ingestion   kafka -- impala using apexLog ingestion   kafka -- impala using apex
Log ingestion kafka -- impala using apex
Apache Apex
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Bowen Li
 
Ingestion file copy using apex
Ingestion   file copy using apexIngestion   file copy using apex
Ingestion file copy using apex
Apache Apex
 

What's hot (20)

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
InfluxDB Live Product Training
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product Training
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Time Series Tech Stack for the IoT Edge
Time Series Tech Stack for the IoT EdgeTime Series Tech Stack for the IoT Edge
Time Series Tech Stack for the IoT Edge
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
 
Centralised logging with ELK stack
Centralised logging with ELK stackCentralised logging with ELK stack
Centralised logging with ELK stack
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Log ingestion kafka -- impala using apex
Log ingestion   kafka -- impala using apexLog ingestion   kafka -- impala using apex
Log ingestion kafka -- impala using apex
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
 
Ingestion file copy using apex
Ingestion   file copy using apexIngestion   file copy using apex
Ingestion file copy using apex
 

Viewers also liked

Charlotte Local Marketing - Marketing for Veterinarians PowerPoint
Charlotte Local Marketing - Marketing for Veterinarians PowerPointCharlotte Local Marketing - Marketing for Veterinarians PowerPoint
Charlotte Local Marketing - Marketing for Veterinarians PowerPoint
Ron Blackwelder
 
Tool investment project - Методическое пособие инвестиционного проекта
Tool investment project - Методическое пособие инвестиционного проектаTool investment project - Методическое пособие инвестиционного проекта
Tool investment project - Методическое пособие инвестиционного проекта
PV Development
 
Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...
Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...
Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...
PV Development
 
Scanning Woes and War Stories
Scanning Woes and War StoriesScanning Woes and War Stories
Scanning Woes and War Stories
Toke Eskildsen
 
Определение цены деления измерительного прибора
Определение цены деления измерительного прибораОпределение цены деления измерительного прибора
Определение цены деления измерительного прибораvlasova-ta
 
Charlotte Local Marketing Marketing for Accountants PowerPoint
Charlotte Local Marketing Marketing for Accountants PowerPointCharlotte Local Marketing Marketing for Accountants PowerPoint
Charlotte Local Marketing Marketing for Accountants PowerPoint
Ron Blackwelder
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for Solr
Toke Eskildsen
 
Charlotte Local Marketing Marketing Budget PowerPoint
Charlotte Local Marketing Marketing Budget PowerPointCharlotte Local Marketing Marketing Budget PowerPoint
Charlotte Local Marketing Marketing Budget PowerPoint
Ron Blackwelder
 
Solr sparse faceting
Solr sparse facetingSolr sparse faceting
Solr sparse faceting
Toke Eskildsen
 
Предпроектная проработка возможного развития территории Конди
Предпроектная проработка возможного развития территории КондиПредпроектная проработка возможного развития территории Конди
Предпроектная проработка возможного развития территории Конди
PV Development
 
Charlotte Local Marketing Brand Establisher PowerPoint
Charlotte Local Marketing Brand Establisher PowerPointCharlotte Local Marketing Brand Establisher PowerPoint
Charlotte Local Marketing Brand Establisher PowerPoint
Ron Blackwelder
 
Report trade network_analysis
Report trade network_analysisReport trade network_analysis
Report trade network_analysis
PV Development
 
All American Pest Control Shares Tips For Avoiding Brown Recluse Bites
All American Pest Control Shares Tips For Avoiding Brown Recluse BitesAll American Pest Control Shares Tips For Avoiding Brown Recluse Bites
All American Pest Control Shares Tips For Avoiding Brown Recluse Bites
allamericanpestcontrol
 
Gabarito exercicios1
Gabarito exercicios1Gabarito exercicios1
Gabarito exercicios1
Carlos Alexandre Lemos
 
Wells Fargo Bank Statement
Wells Fargo Bank StatementWells Fargo Bank Statement
Wells Fargo Bank Statement
rickystutts
 
Концепция создание технопарка 2014 год
Концепция создание технопарка 2014 годКонцепция создание технопарка 2014 год
Концепция создание технопарка 2014 год
PV Development
 

Viewers also liked (16)

Charlotte Local Marketing - Marketing for Veterinarians PowerPoint
Charlotte Local Marketing - Marketing for Veterinarians PowerPointCharlotte Local Marketing - Marketing for Veterinarians PowerPoint
Charlotte Local Marketing - Marketing for Veterinarians PowerPoint
 
Tool investment project - Методическое пособие инвестиционного проекта
Tool investment project - Методическое пособие инвестиционного проектаTool investment project - Методическое пособие инвестиционного проекта
Tool investment project - Методическое пособие инвестиционного проекта
 
Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...
Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...
Toolkit development retail network 2010 - Методическое пособие. "Поиск, откры...
 
Scanning Woes and War Stories
Scanning Woes and War StoriesScanning Woes and War Stories
Scanning Woes and War Stories
 
Определение цены деления измерительного прибора
Определение цены деления измерительного прибораОпределение цены деления измерительного прибора
Определение цены деления измерительного прибора
 
Charlotte Local Marketing Marketing for Accountants PowerPoint
Charlotte Local Marketing Marketing for Accountants PowerPointCharlotte Local Marketing Marketing for Accountants PowerPoint
Charlotte Local Marketing Marketing for Accountants PowerPoint
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for Solr
 
Charlotte Local Marketing Marketing Budget PowerPoint
Charlotte Local Marketing Marketing Budget PowerPointCharlotte Local Marketing Marketing Budget PowerPoint
Charlotte Local Marketing Marketing Budget PowerPoint
 
Solr sparse faceting
Solr sparse facetingSolr sparse faceting
Solr sparse faceting
 
Предпроектная проработка возможного развития территории Конди
Предпроектная проработка возможного развития территории КондиПредпроектная проработка возможного развития территории Конди
Предпроектная проработка возможного развития территории Конди
 
Charlotte Local Marketing Brand Establisher PowerPoint
Charlotte Local Marketing Brand Establisher PowerPointCharlotte Local Marketing Brand Establisher PowerPoint
Charlotte Local Marketing Brand Establisher PowerPoint
 
Report trade network_analysis
Report trade network_analysisReport trade network_analysis
Report trade network_analysis
 
All American Pest Control Shares Tips For Avoiding Brown Recluse Bites
All American Pest Control Shares Tips For Avoiding Brown Recluse BitesAll American Pest Control Shares Tips For Avoiding Brown Recluse Bites
All American Pest Control Shares Tips For Avoiding Brown Recluse Bites
 
Gabarito exercicios1
Gabarito exercicios1Gabarito exercicios1
Gabarito exercicios1
 
Wells Fargo Bank Statement
Wells Fargo Bank StatementWells Fargo Bank Statement
Wells Fargo Bank Statement
 
Концепция создание технопарка 2014 год
Концепция создание технопарка 2014 годКонцепция создание технопарка 2014 год
Концепция создание технопарка 2014 год
 

Similar to Large scale net_archive_toke_eskildsen_iipc_workshop_2015

KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success Story
Kristofferson A
 
Christo kutrovsky oracle rac solving common scalability problems
Christo kutrovsky   oracle rac solving common scalability problemsChristo kutrovsky   oracle rac solving common scalability problems
Christo kutrovsky oracle rac solving common scalability problems
Christo Kutrovsky
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
Fabrizio Fortino
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
The Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQLThe Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQL
Datadog
 
Scalable Storage Configuration for the Physics Database Services
Scalable Storage Configuration for the Physics Database ServicesScalable Storage Configuration for the Physics Database Services
Scalable Storage Configuration for the Physics Database Services
mabessisindu
 
Storage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity CenterStorage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity Center
IBM Danmark
 
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
Tobias Koprowski
 
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
オラクルエンジニア通信
 
Exploring plsql new features best practices september 2013
Exploring plsql new features best practices   september 2013Exploring plsql new features best practices   september 2013
Exploring plsql new features best practices september 2013
Andrejs Vorobjovs
 
MySQL 5.6 Replication Webinar
MySQL 5.6 Replication WebinarMySQL 5.6 Replication Webinar
MySQL 5.6 Replication Webinar
Mark Swarbrick
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
Bobby Curtis
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Christo Kutrovsky - Maximize Data Warehouse Performance with Parallel Queries
Christo Kutrovsky - Maximize Data Warehouse Performance with Parallel QueriesChristo Kutrovsky - Maximize Data Warehouse Performance with Parallel Queries
Christo Kutrovsky - Maximize Data Warehouse Performance with Parallel Queries
Christo Kutrovsky
 

Similar to Large scale net_archive_toke_eskildsen_iipc_workshop_2015 (20)

KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success Story
 
Christo kutrovsky oracle rac solving common scalability problems
Christo kutrovsky   oracle rac solving common scalability problemsChristo kutrovsky   oracle rac solving common scalability problems
Christo kutrovsky oracle rac solving common scalability problems
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
The Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQLThe Data Mullet: From all SQL to No SQL back to Some SQL
The Data Mullet: From all SQL to No SQL back to Some SQL
 
Scalable Storage Configuration for the Physics Database Services
Scalable Storage Configuration for the Physics Database ServicesScalable Storage Configuration for the Physics Database Services
Scalable Storage Configuration for the Physics Database Services
 
Storage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity CenterStorage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity Center
 
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
 
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
 
Exploring plsql new features best practices september 2013
Exploring plsql new features best practices   september 2013Exploring plsql new features best practices   september 2013
Exploring plsql new features best practices september 2013
 
MySQL 5.6 Replication Webinar
MySQL 5.6 Replication WebinarMySQL 5.6 Replication Webinar
MySQL 5.6 Replication Webinar
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Christo Kutrovsky - Maximize Data Warehouse Performance with Parallel Queries
Christo Kutrovsky - Maximize Data Warehouse Performance with Parallel QueriesChristo Kutrovsky - Maximize Data Warehouse Performance with Parallel Queries
Christo Kutrovsky - Maximize Data Warehouse Performance with Parallel Queries
 

Recently uploaded

Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
davidjhones387
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
uehowe
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
Paul Walk
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
SEO Article Boost
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
ysasp1
 
Azure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdfAzure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdf
AanSulistiyo
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
zoowe
 
Design Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptxDesign Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptx
saathvikreddy2003
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
cuobya
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
uehowe
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
fovkoyb
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
zyfovom
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
Laura Szabó
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
uehowe
 

Recently uploaded (20)

Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
 
Azure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdfAzure EA Sponsorship - Customer Guide.pdf
Azure EA Sponsorship - Customer Guide.pdf
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
 
Design Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptxDesign Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptx
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
 

Large scale net_archive_toke_eskildsen_iipc_workshop_2015

  • 1. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 1/26 Scaling Net Archive Indexing & Search IIPC Technical Training Workshop 2014 @TokeEskildsen Low-level search guy (boss says “System Architect”)
  • 2. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 2/26 Scaling SolrCloud indexing ● CPU for analysis, bulk read & write for Solr ● Homogeneous shards (law of large numbers) ● Solr index update entry point might be bottleneck (so use more entry points) ● Routing overhead ● Splitting and moving shards ● Schema changes might require parallel rebuild
  • 3. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 3/26 Static independent shards ● Easy scaling – Predictable resource requirements ● Selective shard rebuilding ● Trivial backup ● Lower overall requirements – Half the JVM heap requirements – Single segment→Higher performance – Less disk cache competition ● Temporal locality – Better disk cache utilization with few users – Hot spot problem with more users – Ranking suffers (in theory) ● No document-level updates! ~250M docs / 900GB shard
  • 4. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 4/26 Static independent shards search Shard 01 Shard 02 Shard 03 Searcher 1 ZooKeeper
  • 5. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 5/26 Building static shards ● Not standard Solr ● Sample setup (distribution optional) – 24 CPU cores (more would be nice) – 1 Solr indexer @ 40 GB RAM – 1 Archon tracking (W)ARC files – 1 Arctika controlling webarchive-discovery (Tika) – 40 webarchive-discovery (Tika) @ 1 GB RAM – Final shards: 250M docs, 900GB, fully optimized Archon + Arctika: https://github.com/netarchivesuite/netsearch
  • 6. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 6/26 Static independent shards index Shard 4 Indexer 1 Shard 5 Indexer 2 Shard 1 Shard 2 Shard 3 Searcher 1 WAD = webarchive-discovery from UKWA: https://github.com/ukwa/webarchive-discovery WAD 1 Arctika 1 WAD 2... WAD n WAD 1 Arctika 1 WAD 2... WAD n ARC-path ARC-path ARC-path ARC-path Archon ARC 1 Storage ARC 2... ARC n
  • 7. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 7/26 Measuring search performance ● Mimick real world scenarios – Unique queries ● preferably logged from production – Warmed caches – Concurrent searches (if relevant) – Measured time, not reported Qtime ● Capture setup data – Index size, shard count, document count, free cache memory, sar logs
  • 8. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 8/26 Predicting scaling requirements ● All else is rarely equal – Disk cache / index size ratio – CPU cores / shard – Slowest shard dictates total response time ● 3 or more measurement points ● Use 2 or more shards ● Visualize measurements
  • 9. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 9/26 SolrCloud distributed search ● Phase 1 – Tophits calculation (fast) – Simple faceting (medium to slow) ● Phase 2 – Document resolving (fast) – Facet fine count (medium to very slow) ● Coordination and merge overhead
  • 10. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 10/26 Interval popularity (aka long tail)
  • 11. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 11/26 ms over time
  • 12. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 12/26 hits, ms
  • 13. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 13/26 log(hits), ms
  • 14. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 14/26 Bucketed percentiles (candlesticks)
  • 15. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 15/26 Abstract search hardware ● IOPS – Needed for concurrent users and/or many shards ● Latency – 1 request = 1 thread / shard (lying a bit) – Lower latency → more IOPS ● Tapes < Spinning drives < SSDs < RAM – But the truth is in the mix
  • 16. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 16/26 Case study: Net Archive Search at State and University Library, Denmark
  • 17. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 17/26 Standard request ● Free-text matching in 6 fields ● Phrase matching i 1 field ● Grouping on URL (not used in the tests) ● Faceting – URL (~6b uniques, 7b references) – Host & domain (millions of uniques, 7b references) – 3 small ones (year, format, public suffix)
  • 18. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 18/26 Solr version & schema ● Solr 4.8.1 + SOLR-5894 patch (optional) ● Piggy backing UKWA work ● DocValues on all large facet fields (essential)
  • 19. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 19/26 Clever Solr config tweaks This space intentionally left blank
  • 20. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 20/26 CPU
  • 21. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 21/26 Disk cache RAM %index mean median 110 0.49 658 141 98 0.44 1004 170 54 0.24 2164 361 27 0.12 5620 913 7 0.03 8546 3012
  • 22. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 22/26 Concurrent requests
  • 23. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 23/26 Concurrent requests (less faceting)
  • 24. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 24/26 Faceting impact mitigation Sparse faceting: http://tokee.github.io/lucene-solr/
  • 25. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 25/26 Fewer, smaller facets
  • 26. Toke Eskildsen te@statsbiblioteket.dk IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 26/26 ● Measure thrice & visualise ● Common Solr rules of thumbs are not always applicable at Net Archive scale ● Static shards makes scaling easier ● SSDs works very well for us (22TB costs £7500) ● Full distributed faceting is doable but heavy Danish Net Archive: http://netarkivet.dk/in-english/ More Solr tech talk: http://sbdevel.wordpress.com

Editor's Notes

  1. The two main drawbacks of running a very large SolrCloud is The memory overhead of handling both search &amp; indexing for each shard Having to re-build the whole index in one go, when a larger change to the schema is introduced
  2. From a search perspective, searching the whole collection of shards is done exactly as normally with SolrCloud: ZooKeeper keeps track of shard status and searches are distributed.
  3. Static shard indexing is fully independent of SolrCloud (ant ZooKeeper). Archon is a database, keeping track of the total amount of ARC-files and their status (non-indexed, being indexed, indexed, failed). There is only one Archon per logical collection (the whole Net Archive is one logical collection). Arctika is responsible for spawning WADs (webarchive-discovery from UKWA). It asks Archon for the path of an ARC in need of indexing and starts a new JVM, using WAD to analyze the single ARC and send the result into an indexer. It keeps doing this until the index has reached the wanted size; then it optimized the index and it is ready to be copied to the searcher. There is one Arctika per Solr Indexer. Each indexer is a plain Solr setup with a single shard.
  4. Single-shard Solr installations only uses 1 phase to compute the result. The speed of phase 2 in a SolrCloud search varies from much faster than phase 2 to very much slower, primarily dictated by how much fine-counting is needed for faceting. The merge-overhead is normally very small (few milliseconds).
  5. All testing done on the full index with 25 shards @ 900GB, with a total of 7 billion documents.
  6. Hyper Threading does have an effect. The number of cores beyone 8 has little influence on response speed for smaller result sets. Note: Scaling down requirements and scaling up concurrent requests is likely to require more CPU power.
  7. Graph for 22TB index on 25 SSDs. The amount of disk cache was controlled by a tiny program memeater (see https://github.com/tokee/memeater). Disk cache, or rather lack of disk cache, has profound impact on Solr performance. For SSDs as well as spinning drives. The difference being the amount needed overall and the worst-case penalty for cache-misses (not shown on the graph). At Statsbiblioteket we would like “normal” searches to return within 2 seconds. While we might get by with 100GB free memory for cache, 54GB would not be enough with current requirements.
  8. For full faceting, just 2 concurrent requests makes it too slow for result sets larger than 10M documents, which is not a good thing. Mean performance is 2-3 QPS.
  9. Turning off URL faceting helps a great deal. Performance is on average ~10 QPS. Throughput rises from 1-4 concurrent threads, after which there is no further gain.
  10. no_facet (blue) is plain search without faceting. skip_facets (purple) is faceting without fine-counting (phase 1 only). This gives imprecise results. We have not yet measured how much that influences the result. sparse_facets (green) is sparse faceting (http://tokee.github.io/lucene-solr/), an optimization of standard Solr faceting. solr_facet (orange) is standard Solr faceting. Notice how the measurement points is missing for more than 10.000 hits – this is due to the slow speed and limited test time.
  11. Faceting without the URL field makes it feasible to use standard Solr faceting and nearly free to use faceting without fine-counting or sparse faceting, as long as the result set is below 100M documents.
  12. Although SSDs themselves are very inexpensive, building a single machine with 25 drives requires quite a controller. It might be better to buy smaller machines, maybe with 4-6 drives @ 1TB, 64GB RAM and 4 CPU cores? Start with a single machine and measure!