Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote

Search Architecture at Evernote
Not Your Typical Big Data Problem
CHRISTIAN KOHLSCHÜTTER
Sr. Search Researcher
Augmented Intelligence @ Evernote

Serving 100+ Million Users Worldwide
• 559 Shards (200k users per shard), Linux/Tomcat/MySQL
• 3.2 PB WebDAV-based Storage
• 224 TB SSD capacity for System, MySQL and Lucene
• 3.1 Billion Notes stored, 3.8 Bn Notes ever created
• 115 Million Notes created or edited last week
• 26 Million API calls to Context last week
• 1 Lucene index per user

Evernote’s Three Laws of Data Protection
• Your Data is Yours
• Your Data is Protected
• Your Data is Portable
We are not a “big data” company and do not try to make
money from your content.

Technical Debt
• I/O over Lucene 2.9 indexes became a bottleneck
• Code was woven into our “NoteStore” platform
• Index changes had to be backwards-compatible
• Complex re-indexing would require taking down a shard
• Needed to rethink the entire architecture, but keep public API
• Make search faster vs. Make us move faster

From Lucene 2.9 to 4.x and beyond
• Large refactoring of search code
• Lucene no longer is a direct dependency in “NoteStore”
• Design-by-Contract
• Can now run multiple Lucene versions concurrently in one VM
• … and one specific version / schema per user
• Migrated all users to Lucene 4.5, avg. downtime/user < 1 min

Separate the What from the How

Separation of Concerns
UserIndex
Manager
UserIndex
Factory
NoteStore
UserIndex
Benchmarking UserIndex
Lucene29
UserIndexImpl
Lucene4
UserIndexImpl
API
Implementation
Caching UserIndex
...

Hide Lucene behind ClassLoaders
• One Maven artifact per major Lucene version,
build profiles for code-reuse between minor updates
• Code is packaged with dependencies into one common fat-jar with prefixes for each
implementation:
- lucene29/org/apache/lucene/...
lucene29/com/evernote/search/lucene2/…
- lucene43/org/apache/lucene/...
- lucene45/org/apache/lucene/…
• ResourcePrefixClassLoader called from outside code strips prefix,
uses fat-jar as the only dependency

New Index Structure
• Each user’s index now comes with a properties file that
describes its internal structures, such as index type and
version. Can handle different behavior in code.
• Changes to the index schema? Just increase the index version
and handle the rest in code
• Automatically trigger re-indexing if necessary

Index Auto-Migration
• Target Default Index Implementation centrally set by DevOps
• Triggered upon UserIndex access
• UserIndex facade determines whether re-index is necessary
• “Cruise Control” automates off-peak access
# Threads

Phase 1: Migration to Lucene 4
• Changes in Disk I/O (CPU correlates)
overall: -81%
searchRelatedNotes: -87%
keyword-based search: -96%
Saves TBs of I/O

Phase 2: Add Compression
• User Indexes sizes and access patterns are skewed
• Optimize large accounts
• Directory-level compression
• Compress segment files, invisible to the IndexReader
• Only when re-indexing / every 3 months
• In-memory Caching

LuceneTransform
• https://code.google.com/p/lucenetransform by Mitja Lenič
• We ported it to Lucene 4.5 (now available upstream for 4.9)
• Improved LRU caching, added LZ4/Snappy compression
• We will contribute our changes soon

OverlayDirectory
on disk:
_23.cfe
_23.si
c$_23.cfs
segments.gen
segments_2
visible to IndexReader:
_23.cfe
_23.si
_23.cfs
segments.gen
segments_2

Results
• Compressed the largest 5% of all indexes using LZ4
• 1.9 TB index space saved
• 100 MB LRU Cache hit rate: 79% on avg (67% — 93%)
• Saved 0.5 PB disk reads/week
• Cache is so good, may use better/slower compression algorithm,
may apply to more users
Saves PBs of I/O

Bugs, Bugs, Bugs :-)
• We’ve been warned
“VInt bug”
“background merge hit exception”
JVM segfaults
!
• and then this happened, too
SPI / ContextClassLoaders … LUCENE-4713
Deadlocks / over-optimistic locking
Unclosed resources / Too many open file handles => HousekeepingDirectory
Issues with FieldCache singleton => LUCENE-831, LUCENE-2133, …
…
• UserIndex tracks “broken” state; allows self-healing (rebuild)

Conclusion
• Design-By-Contract, Separation of Concerns
• Per-user Search Implementation / Multiple Lucene versions
• Migrated 60M users, without noticeable downtime
• Migration allowed index changes, saves TBs of disk I/O
• Block-level Index Compression, saves PBs of disk I/O
• This is just the beginning.

Thank you
christian@evernote.com

We’re hiring
evernote.com/careers

Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote

More Related Content

What's hot

Viewers also liked

Similar to Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote

More from Lucidworks

Recently uploaded

Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote