SlideShare a Scribd company logo
Scaling Box-Search:
Gearing up for petabyte scale
2Scaling Box-Search: Gearing up for petabyte scale
Human perception of
Realtime
13 milliseconds
3Scaling Box-Search: Gearing up for petabyte scale
Anthony UrbanowiczShubhro Roy
Staff Software Engineer
Box
Senior Software Engineer
Box
Agenda
/ Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
Agenda / Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
Cloud Content Management platform that enables you to
securely manage, share and access your content from anywhere
7Scaling Box-Search: Gearing up for petabyte scale
Search use cases
• Quick-search
• Full-text content search
• Metadata search
• Search API
8Scaling Box-Search: Gearing up for petabyte scale
Search index stats
/ Indexed files : Hundreds of billions
/ Replicated index size: 600+ TB
/ Files indexed per day: Hundreds of millions
/ Rate of index growth: ~2 TB / month
/ Projected index size in 5 years : ~1 petabyte
Agenda
/ Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
10Scaling Box-Search: Gearing up for petabyte scale
Index pipeline
WebApp/
API
Shard 0 Shard 1 Shard n
Pub-Sub Queue
Search
Workers
Events
Offline Store
(HBase)
Search
WorkersSearch
Worker
Cluster
Shard 0 Shard 1 Shard n
Cluster
Offline
ReIndex
fileId % numShards = shardId
Representation
Service
11Scaling Box-Search: Gearing up for petabyte scale
Query Pipeline
WebApp/
API
Shard 0 Shard 1 Shard n
Cluster
Query
Cerebro
Rate-Limiter
Circuit-Breaker
Re-Ranker
Federator
Shard 0 Shard 1 Shard n
Cluster
App-DB
(MySQL)
fq = <permissions filter>
12Scaling Box-Search: Gearing up for petabyte scale
Cluster Management
Shard 0 Shard 1 Shard n
Cluster
Search
Worker
Cerebro
Zookeeper
Sops
Shard Health
Monitor
Commands
Agenda
/ Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
14Scaling Box-Search: Gearing up for petabyte scale
Major pain points
/ Queries bottlenecked by slowest shard
/ Single user/enterprise can overwhelm the whole cluster
/ Query latency increases as we add more machines
15Scaling Box-Search: Gearing up for petabyte scale
Solution: enterprise
based sharding
/ Natural grouping of documents
/ Deterministic sharding approach
/ Reduces query fan-out to just one shard in most cases
/ Better index compression
/ Better utilization of filter-cache
16Scaling Box-Search: Gearing up for petabyte scale
Solr composite-id routing
/ ShardingKey / k ! documentId
/ k: number of bits to use from the sharding key
/ ShardingKey = hash(enterpriseId)
/ Solr Query: q=solr&_route_=<enterpriseId>/k!
17Scaling Box-Search: Gearing up for petabyte scale
• Non-uniform shard sizes
• Need to generate and maintain
mapping of enterpriseId  K
• Can’t handle new enterprises
without offline rebuild
Issues with composite-id routing
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115
120
125
Shard Document Counts
Agenda
/ Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
19Scaling Box-Search: Gearing up for petabyte scale
Key ranges
n/k keys
Dynamic key range partitioning
• Unique sharding key per document
• Sort sharding keys in byte order
• Partition rows into k buckets
• Each shard has n/k documents
Logical shard
0
Logical shard
1
Logical shard
2
Logical shard
3
Logical shard
4
Logical shard
5
20Scaling Box-Search: Gearing up for petabyte scale
• h(enterpriseId): All documents of an
enterprise are grouped together
• h(folderId): d-gap compression
• h(fileId): uniqueness
Sharding key
4 bytes
h(enterprise_id) h(folder_id) h(file_id)
4 bytes 8 bytes
21Scaling Box-Search: Gearing up for petabyte scale
Shard routing
Index routing Query routingShard boundaries
Shardingkey
Binary search
Binary search +
Linear probing
H(enterpriseid)
22Scaling Box-Search: Gearing up for petabyte scale
Lumos architecture
23Scaling Box-Search: Gearing up for petabyte scale
• Write ShardingKey to HBase at Index time
• Stores keys in sorted byte-order
• MR job scans HBase table to generate shard boundaries for a cluster
• Maintain shard boundaries per cluster in HBase
HBase metastore
24Scaling Box-Search: Gearing up for petabyte scale
Shard routing
Search Worker /
Cerebro
Lumos
Controller
Shard 0 Shard 1 Shard n
HBase
Zookeeper
Index / Query
Request
Cluster
25Scaling Box-Search: Gearing up for petabyte scale
No free lunch
Agenda
/ Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
27Scaling Box-Search: Gearing up for petabyte scale
• Enterprises grow at different rates
• Machines have max capacity of 3TB
• Some shards exceed capacity while
some had lot to spare
Shard growth with
live indexing
FILECOUNT
SHARD
Shard growth in 2 weeks
28Scaling Box-Search: Gearing up for petabyte scale
• Spill new index requests to shard(s) with most capacity
• Optimal use of cluster capacity
• Automated load-balancing without any downtime
Solution: Shard spilling
Agenda
/ Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
30Scaling Box-Search: Gearing up for petabyte scale
Shard allocation
Key ranges Node 1
Node 2
Logical shard
0
Logical shard
1
Logical shard
2
Logical shard
3
Logical shard
4
Logical shard
5
Physical shard 0
Physical shard 1
Physical shard 2
Physical shard 3
Physical shard 4
Physical shard 5
31Scaling Box-Search: Gearing up for petabyte scale
Shard allocation : Binpacking
B
I
N
P
A
C
K
I
N
G
Key ranges
Logical shard
0
Logical shard
1
Logical shard
2
Logical shard
3
Logical shard
4
Logical shard
5
Node 1
Node 2
Physical shard 0
Physical shard 1
Physical shard 2
Physical shard 3
Physical shard 4
Physical shard 5
32Scaling Box-Search: Gearing up for petabyte scale
• Compute QPS per logical shard
• Minimize overall QPS at each machine
• NP- hard problem
• Greedy approximation: descending first fit
Shard allocation: Binpacking
Minimize 𝑗=0
𝑛
𝑤𝑖 𝑥𝑖𝑗 for each bin
𝑥𝑖𝑗 𝜖 {0,1}
𝑤𝑖  weight of items
n = number of bins
33Scaling Box-Search: Gearing up for petabyte scale
Shard allocation : machine QPS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
QPS
Hours
Machine QPS (Random Shard -> Machine)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
QPS
Hours
Machine QPS Binpacking
Agenda
/ Search @ Box
/ Existing infrastructure
/ Scaling issues
/ Key range partitioning
/ Shard growth
/ Shard allocation
/ Future work
/ Q&A
35Scaling Box-Search: Gearing up for petabyte scale
Future of Box Search
What is enterprise search / Currently, routing is "implicit"
/ Integrating key range partitioning into
SolrCloud
/ Extend SolrClouds DocRouter with the Lumos
Controller for a new mode
/ Reuse SolrClouds zk coordinator
/ Create a Search Federation Service on top to:
• Classify and expand queries
• Route to multiple collections and verticals
• ML ranking of results
36Scaling Box-Search: Gearing up for petabyte scale
Enterprise search
pros and cons
/ Con: cannot crowdsource for human judgements
due to customer document privacy
/ Con: users document set has little to no overlap,
hence lack of query-doc consensus
/ Pro: semi closed loop, documents are previewed
within our domain
/ Pro: signals beyond clicks such as comment, edit,
download and share
Data for training and features
37Scaling Box-Search: Gearing up for petabyte scale
Relevance pyramid
Data for training and features
Search resultant
events to train a
relevance model
Train the below features using session
logs and the above relevance model
Term based features
(BM25, Classifications,
Query alterations, etc..)
Relationship based
features from Box
Graph (Non search
events)
38Scaling Box-Search: Gearing up for petabyte scale
Sue
Bob
File2
Joe
Sue
Bob
File3
Explicit user interactionsImplicit user graph
Joe
File1
Box graph
39Scaling Box-Search: Gearing up for petabyte scale
Joe
Sue
Bob
File1
File2
Joe
Sue
Bob
File3
Explicit user interactionsImplicit user graph
Box graph
40Scaling Box-Search: Gearing up for petabyte scale
Joe
Sue
Bob
File1
File2
Joe
Sue
Bob
File3
Explicit user interactionsImplicit user graph
Box graph
41Scaling Box-Search: Gearing up for petabyte scale
Joe
Sue
Bob
File1
File2
Joe
Sue
Bob
File3
Explicit user interactionsImplicit user graph
Box graph
42Scaling Box-Search: Gearing up for petabyte scale
Joe
Sue
Bob
Joe
Sue
Bob
File1
File2
File3
Explicit user interactionsImplicit user graph
Box graph
43Scaling Box-Search: Gearing up for petabyte scale
Joe
Sue
Bob
File1
File2
Joe
Sue
Bob
File3
Explicit user interactionsImplicit user graph
Box graph
Shubhro Roy
sroy@box.com
Anthony Urbanowicz
aurbanowicz@box.com
We are Hiring !
https://www.box.com/careers/engineering
Thank you
Q&A

More Related Content

What's hot

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Biomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challengesBiomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challenges
Anja Pilz
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
lucenerevolution
 
Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4
TigerGraph
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 

What's hot (10)

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Biomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challengesBiomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challenges
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
 
Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 

Similar to Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urbanowicz, Box

Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
David Pilato
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
Sylvain Wallez
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
Dylan Tong
 
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
Petter Skodvin-Hvammen
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
Comperio - Search Matters.
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
C4Media
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
nkabra
 
Codemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labCodemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech lab
Ugo Landini
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
MongoDB
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
Jim Dowling
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Swiss Big Data User Group
 
Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04marc_harrison
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 

Similar to Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urbanowicz, Box (20)

Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Codemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labCodemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech lab
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urbanowicz, Box

  • 1. Scaling Box-Search: Gearing up for petabyte scale
  • 2. 2Scaling Box-Search: Gearing up for petabyte scale Human perception of Realtime 13 milliseconds
  • 3. 3Scaling Box-Search: Gearing up for petabyte scale Anthony UrbanowiczShubhro Roy Staff Software Engineer Box Senior Software Engineer Box
  • 4. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 5. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 6. Cloud Content Management platform that enables you to securely manage, share and access your content from anywhere
  • 7. 7Scaling Box-Search: Gearing up for petabyte scale Search use cases • Quick-search • Full-text content search • Metadata search • Search API
  • 8. 8Scaling Box-Search: Gearing up for petabyte scale Search index stats / Indexed files : Hundreds of billions / Replicated index size: 600+ TB / Files indexed per day: Hundreds of millions / Rate of index growth: ~2 TB / month / Projected index size in 5 years : ~1 petabyte
  • 9. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 10. 10Scaling Box-Search: Gearing up for petabyte scale Index pipeline WebApp/ API Shard 0 Shard 1 Shard n Pub-Sub Queue Search Workers Events Offline Store (HBase) Search WorkersSearch Worker Cluster Shard 0 Shard 1 Shard n Cluster Offline ReIndex fileId % numShards = shardId Representation Service
  • 11. 11Scaling Box-Search: Gearing up for petabyte scale Query Pipeline WebApp/ API Shard 0 Shard 1 Shard n Cluster Query Cerebro Rate-Limiter Circuit-Breaker Re-Ranker Federator Shard 0 Shard 1 Shard n Cluster App-DB (MySQL) fq = <permissions filter>
  • 12. 12Scaling Box-Search: Gearing up for petabyte scale Cluster Management Shard 0 Shard 1 Shard n Cluster Search Worker Cerebro Zookeeper Sops Shard Health Monitor Commands
  • 13. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 14. 14Scaling Box-Search: Gearing up for petabyte scale Major pain points / Queries bottlenecked by slowest shard / Single user/enterprise can overwhelm the whole cluster / Query latency increases as we add more machines
  • 15. 15Scaling Box-Search: Gearing up for petabyte scale Solution: enterprise based sharding / Natural grouping of documents / Deterministic sharding approach / Reduces query fan-out to just one shard in most cases / Better index compression / Better utilization of filter-cache
  • 16. 16Scaling Box-Search: Gearing up for petabyte scale Solr composite-id routing / ShardingKey / k ! documentId / k: number of bits to use from the sharding key / ShardingKey = hash(enterpriseId) / Solr Query: q=solr&_route_=<enterpriseId>/k!
  • 17. 17Scaling Box-Search: Gearing up for petabyte scale • Non-uniform shard sizes • Need to generate and maintain mapping of enterpriseId  K • Can’t handle new enterprises without offline rebuild Issues with composite-id routing 200,000,000 400,000,000 600,000,000 800,000,000 1,000,000,000 1,200,000,000 1,400,000,000 1,600,000,000 1,800,000,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 Shard Document Counts
  • 18. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 19. 19Scaling Box-Search: Gearing up for petabyte scale Key ranges n/k keys Dynamic key range partitioning • Unique sharding key per document • Sort sharding keys in byte order • Partition rows into k buckets • Each shard has n/k documents Logical shard 0 Logical shard 1 Logical shard 2 Logical shard 3 Logical shard 4 Logical shard 5
  • 20. 20Scaling Box-Search: Gearing up for petabyte scale • h(enterpriseId): All documents of an enterprise are grouped together • h(folderId): d-gap compression • h(fileId): uniqueness Sharding key 4 bytes h(enterprise_id) h(folder_id) h(file_id) 4 bytes 8 bytes
  • 21. 21Scaling Box-Search: Gearing up for petabyte scale Shard routing Index routing Query routingShard boundaries Shardingkey Binary search Binary search + Linear probing H(enterpriseid)
  • 22. 22Scaling Box-Search: Gearing up for petabyte scale Lumos architecture
  • 23. 23Scaling Box-Search: Gearing up for petabyte scale • Write ShardingKey to HBase at Index time • Stores keys in sorted byte-order • MR job scans HBase table to generate shard boundaries for a cluster • Maintain shard boundaries per cluster in HBase HBase metastore
  • 24. 24Scaling Box-Search: Gearing up for petabyte scale Shard routing Search Worker / Cerebro Lumos Controller Shard 0 Shard 1 Shard n HBase Zookeeper Index / Query Request Cluster
  • 25. 25Scaling Box-Search: Gearing up for petabyte scale No free lunch
  • 26. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 27. 27Scaling Box-Search: Gearing up for petabyte scale • Enterprises grow at different rates • Machines have max capacity of 3TB • Some shards exceed capacity while some had lot to spare Shard growth with live indexing FILECOUNT SHARD Shard growth in 2 weeks
  • 28. 28Scaling Box-Search: Gearing up for petabyte scale • Spill new index requests to shard(s) with most capacity • Optimal use of cluster capacity • Automated load-balancing without any downtime Solution: Shard spilling
  • 29. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 30. 30Scaling Box-Search: Gearing up for petabyte scale Shard allocation Key ranges Node 1 Node 2 Logical shard 0 Logical shard 1 Logical shard 2 Logical shard 3 Logical shard 4 Logical shard 5 Physical shard 0 Physical shard 1 Physical shard 2 Physical shard 3 Physical shard 4 Physical shard 5
  • 31. 31Scaling Box-Search: Gearing up for petabyte scale Shard allocation : Binpacking B I N P A C K I N G Key ranges Logical shard 0 Logical shard 1 Logical shard 2 Logical shard 3 Logical shard 4 Logical shard 5 Node 1 Node 2 Physical shard 0 Physical shard 1 Physical shard 2 Physical shard 3 Physical shard 4 Physical shard 5
  • 32. 32Scaling Box-Search: Gearing up for petabyte scale • Compute QPS per logical shard • Minimize overall QPS at each machine • NP- hard problem • Greedy approximation: descending first fit Shard allocation: Binpacking Minimize 𝑗=0 𝑛 𝑤𝑖 𝑥𝑖𝑗 for each bin 𝑥𝑖𝑗 𝜖 {0,1} 𝑤𝑖  weight of items n = number of bins
  • 33. 33Scaling Box-Search: Gearing up for petabyte scale Shard allocation : machine QPS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 QPS Hours Machine QPS (Random Shard -> Machine) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 QPS Hours Machine QPS Binpacking
  • 34. Agenda / Search @ Box / Existing infrastructure / Scaling issues / Key range partitioning / Shard growth / Shard allocation / Future work / Q&A
  • 35. 35Scaling Box-Search: Gearing up for petabyte scale Future of Box Search What is enterprise search / Currently, routing is "implicit" / Integrating key range partitioning into SolrCloud / Extend SolrClouds DocRouter with the Lumos Controller for a new mode / Reuse SolrClouds zk coordinator / Create a Search Federation Service on top to: • Classify and expand queries • Route to multiple collections and verticals • ML ranking of results
  • 36. 36Scaling Box-Search: Gearing up for petabyte scale Enterprise search pros and cons / Con: cannot crowdsource for human judgements due to customer document privacy / Con: users document set has little to no overlap, hence lack of query-doc consensus / Pro: semi closed loop, documents are previewed within our domain / Pro: signals beyond clicks such as comment, edit, download and share Data for training and features
  • 37. 37Scaling Box-Search: Gearing up for petabyte scale Relevance pyramid Data for training and features Search resultant events to train a relevance model Train the below features using session logs and the above relevance model Term based features (BM25, Classifications, Query alterations, etc..) Relationship based features from Box Graph (Non search events)
  • 38. 38Scaling Box-Search: Gearing up for petabyte scale Sue Bob File2 Joe Sue Bob File3 Explicit user interactionsImplicit user graph Joe File1 Box graph
  • 39. 39Scaling Box-Search: Gearing up for petabyte scale Joe Sue Bob File1 File2 Joe Sue Bob File3 Explicit user interactionsImplicit user graph Box graph
  • 40. 40Scaling Box-Search: Gearing up for petabyte scale Joe Sue Bob File1 File2 Joe Sue Bob File3 Explicit user interactionsImplicit user graph Box graph
  • 41. 41Scaling Box-Search: Gearing up for petabyte scale Joe Sue Bob File1 File2 Joe Sue Bob File3 Explicit user interactionsImplicit user graph Box graph
  • 42. 42Scaling Box-Search: Gearing up for petabyte scale Joe Sue Bob Joe Sue Bob File1 File2 File3 Explicit user interactionsImplicit user graph Box graph
  • 43. 43Scaling Box-Search: Gearing up for petabyte scale Joe Sue Bob File1 File2 Joe Sue Bob File3 Explicit user interactionsImplicit user graph Box graph
  • 44. Shubhro Roy sroy@box.com Anthony Urbanowicz aurbanowicz@box.com We are Hiring ! https://www.box.com/careers/engineering Thank you Q&A

Editor's Notes

  1. Welcome to our talk Interesting article: Human Perception of Realtime
  2. Visual Stimuli: 13 millisecond > 100ms is NOT instantaneous. > 2s user has disengaged Latency is critical for search NOT 10 blue links Additional processing: Query Expansion, Relevance Ranking, Collaborative Filtering etc.
  3. Before we begin: How many of you have used Box?
  4. What you will learn: Typical challenges with distributed search infra How to address them at scale
  5. Collaboration and content management platform Create and share content securely Content Discovery: find relevant conent 300K+ enterprises Millions of users
  6. Type-ahead style search Snippeting Highlighting Date / author / image recognition / video transcripts Api: Notes / Mobile / Platform
  7. - >60 billion files > 100 million files per day Need to take a long hard look at our infrastructure to ensure it scales
  8. Index Process Cluster => Machine => Shard Different clusters in different DCs => DR & HA Offline store + Reindex
  9. <FASTER> Machines go down all the time Systems should be agnostic to such failures
  10. Solution: Reduce query fanout - Shard by some natural grouping of documents
  11. - Shard by some natural grouping of documents Users access restricted to enterprise Only query eid shard Same eid  similar content / same lang  index compression Similar fqs  Warmed up cache
  12. Cant have 1:1 mapping Large enterprises will blow up shards
  13. Few massively large enterprises Need more control on distribution of documents New enterprise issue Better to be dynamic than predictive
  14. Introduce key range partitioning
  15. Solves new enterprise case
  16. Entire sharding logic encapsulated by Lumos Hbase: Metastore Zookeeper: Co-ordination
  17. Reasons for using Hbase: Already used Sorted keys MR integrations
  18. Old Way: Uniform shard growth New Way: Enterprises grow differently
  19. - We cannot add new machines when other machines have more capacity
  20. Increase in query fanout  optimal utilization of space
  21. Solving the Second problem= shard to node allocation
  22. Niave approach Some enterprises use box a lot more then others. Hence alot shard have a lot more query load then others. This causes hot spotting.
  23. Binpacking a type of scheduling to level out load over all nodes
  24. There are optimal and non optimal algorythms, We used a greedy algo of descending first fit, which is zipping up the least with the highest QPS shards. How does it work?
  25. Example of applied binpacking that decrease spikes.
  26. Over view of what will be added to infra. Add all that was said as a mode into solr cloud so you all can scale too. Beyond that - Adding relevance to search To effect relevance we need to have data – transition
  27. Diff between web search and enterprise search at box. Transition to ---- how to maximize user feedbacks effect on relevance.
  28. Top down No labeled sets, use a relavance model trained from end to end events to extract good clicks. Some results are good even without an event as they user might of just wanted a peiece of info from the document. Use this to extract query-doc pairs, and train all non search-event based features. These are BM25, document-query classifications, and the box graph.
  29. Explicit interactions, creates implicit connections between the users.
  30. If two users interact on a same content, they have common interests
  31. These interactions can lead to larger graphs
  32. Strength of the connection represents collaboration strength
  33. More common files you work stronger the collaboration strength is.
  34. These are what we are working on and commit and present in 2019.