Search Architecture at Evernote 
Not Your Typical Big Data Problem 
CHRISTIAN KOHLSCHÜTTER 
Sr. Search Researcher 
Augmented Intelligence @ Evernote
We are the workspace.
Write Collect Find Present
Find
Collect
Serving 100+ Million Users Worldwide 
• 559 Shards (200k users per shard), Linux/Tomcat/MySQL 
• 3.2 PB WebDAV-based Storage 
• 224 TB SSD capacity for System, MySQL and Lucene 
• 3.1 Billion Notes stored, 3.8 Bn Notes ever created 
• 115 Million Notes created or edited last week 
• 26 Million API calls to Context last week 
• 1 Lucene index per user
Evernote’s Three Laws of Data Protection 
• Your Data is Yours 
• Your Data is Protected 
• Your Data is Portable 
We are not a “big data” company and do not try to make 
money from your content.
Technical Debt 
• I/O over Lucene 2.9 indexes became a bottleneck 
• Code was woven into our “NoteStore” platform 
• Index changes had to be backwards-compatible 
• Complex re-indexing would require taking down a shard 
• Needed to rethink the entire architecture, but keep public API 
• Make search faster vs. Make us move faster
From Lucene 2.9 to 4.x and beyond 
• Large refactoring of search code 
• Lucene no longer is a direct dependency in “NoteStore” 
• Design-by-Contract 
• Can now run multiple Lucene versions concurrently in one VM 
• … and one specific version / schema per user 
• Migrated all users to Lucene 4.5, avg. downtime/user < 1 min
Separate the What from the How
Separation of Concerns 
UserIndex 
Manager 
UserIndex 
Factory 
NoteStore 
UserIndex 
Benchmarking UserIndex 
Lucene29 
UserIndexImpl 
Lucene4 
UserIndexImpl 
API 
Implementation 
Caching UserIndex 
...
Hide Lucene behind ClassLoaders 
• One Maven artifact per major Lucene version, 
build profiles for code-reuse between minor updates 
• Code is packaged with dependencies into one common fat-jar with prefixes for each 
implementation: 
- lucene29/org/apache/lucene/... 
lucene29/com/evernote/search/lucene2/… 
- lucene43/org/apache/lucene/... 
lucene43/com/evernote/search/lucene4/… 
- lucene45/org/apache/lucene/… 
lucene45/com/evernote/search/lucene4/… 
• ResourcePrefixClassLoader called from outside code strips prefix, 
uses fat-jar as the only dependency
New Index Structure 
• Each user’s index now comes with a properties file that 
describes its internal structures, such as index type and 
version. Can handle different behavior in code. 
• Changes to the index schema? Just increase the index version 
and handle the rest in code 
• Automatically trigger re-indexing if necessary
Index Auto-Migration 
• Target Default Index Implementation centrally set by DevOps 
• Triggered upon UserIndex access 
• UserIndex facade determines whether re-index is necessary 
• “Cruise Control” automates off-peak access 
# Threads
Phase 1: Migration to Lucene 4 
• Changes in Disk I/O (CPU correlates) 
overall: -81% 
searchRelatedNotes: -87% 
keyword-based search: -96% 
Saves TBs of I/O
Phase 2: Add Compression 
• User Indexes sizes and access patterns are skewed 
• Optimize large accounts 
• Directory-level compression 
• Compress segment files, invisible to the IndexReader 
• Only when re-indexing / every 3 months 
• In-memory Caching
LuceneTransform 
• https://code.google.com/p/lucenetransform by Mitja Lenič 
• We ported it to Lucene 4.5 (now available upstream for 4.9) 
• Improved LRU caching, added LZ4/Snappy compression 
• We will contribute our changes soon
OverlayDirectory 
on disk: 
_23.cfe 
_23.si 
c$_23.cfs 
segments.gen 
segments_2 
visible to IndexReader: 
_23.cfe 
_23.si 
_23.cfs 
segments.gen 
segments_2
Results 
• Compressed the largest 5% of all indexes using LZ4 
• 1.9 TB index space saved 
• 100 MB LRU Cache hit rate: 79% on avg (67% — 93%) 
• Saved 0.5 PB disk reads/week 
• Cache is so good, may use better/slower compression algorithm, 
may apply to more users 
Saves PBs of I/O
Bugs, Bugs, Bugs :-) 
• We’ve been warned 
“VInt bug” 
“background merge hit exception” 
JVM segfaults 
! 
• and then this happened, too 
SPI / ContextClassLoaders … LUCENE-4713 
Deadlocks / over-optimistic locking 
Unclosed resources / Too many open file handles => HousekeepingDirectory 
Issues with FieldCache singleton => LUCENE-831, LUCENE-2133, … 
… 
• UserIndex tracks “broken” state; allows self-healing (rebuild)
Conclusion 
• Design-By-Contract, Separation of Concerns 
• Per-user Search Implementation / Multiple Lucene versions 
• Migrated 60M users, without noticeable downtime 
• Migration allowed index changes, saves TBs of disk I/O 
• Block-level Index Compression, saves PBs of disk I/O 
• This is just the beginning.
Thank you 
christian@evernote.com
We’re hiring 
evernote.com/careers

Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote

  • 1.
    Search Architecture atEvernote Not Your Typical Big Data Problem CHRISTIAN KOHLSCHÜTTER Sr. Search Researcher Augmented Intelligence @ Evernote
  • 2.
    We are theworkspace.
  • 3.
  • 5.
  • 9.
  • 13.
    Serving 100+ MillionUsers Worldwide • 559 Shards (200k users per shard), Linux/Tomcat/MySQL • 3.2 PB WebDAV-based Storage • 224 TB SSD capacity for System, MySQL and Lucene • 3.1 Billion Notes stored, 3.8 Bn Notes ever created • 115 Million Notes created or edited last week • 26 Million API calls to Context last week • 1 Lucene index per user
  • 14.
    Evernote’s Three Lawsof Data Protection • Your Data is Yours • Your Data is Protected • Your Data is Portable We are not a “big data” company and do not try to make money from your content.
  • 15.
    Technical Debt •I/O over Lucene 2.9 indexes became a bottleneck • Code was woven into our “NoteStore” platform • Index changes had to be backwards-compatible • Complex re-indexing would require taking down a shard • Needed to rethink the entire architecture, but keep public API • Make search faster vs. Make us move faster
  • 16.
    From Lucene 2.9to 4.x and beyond • Large refactoring of search code • Lucene no longer is a direct dependency in “NoteStore” • Design-by-Contract • Can now run multiple Lucene versions concurrently in one VM • … and one specific version / schema per user • Migrated all users to Lucene 4.5, avg. downtime/user < 1 min
  • 17.
    Separate the Whatfrom the How
  • 18.
    Separation of Concerns UserIndex Manager UserIndex Factory NoteStore UserIndex Benchmarking UserIndex Lucene29 UserIndexImpl Lucene4 UserIndexImpl API Implementation Caching UserIndex ...
  • 19.
    Hide Lucene behindClassLoaders • One Maven artifact per major Lucene version, build profiles for code-reuse between minor updates • Code is packaged with dependencies into one common fat-jar with prefixes for each implementation: - lucene29/org/apache/lucene/... lucene29/com/evernote/search/lucene2/… - lucene43/org/apache/lucene/... lucene43/com/evernote/search/lucene4/… - lucene45/org/apache/lucene/… lucene45/com/evernote/search/lucene4/… • ResourcePrefixClassLoader called from outside code strips prefix, uses fat-jar as the only dependency
  • 20.
    New Index Structure • Each user’s index now comes with a properties file that describes its internal structures, such as index type and version. Can handle different behavior in code. • Changes to the index schema? Just increase the index version and handle the rest in code • Automatically trigger re-indexing if necessary
  • 21.
    Index Auto-Migration •Target Default Index Implementation centrally set by DevOps • Triggered upon UserIndex access • UserIndex facade determines whether re-index is necessary • “Cruise Control” automates off-peak access # Threads
  • 22.
    Phase 1: Migrationto Lucene 4 • Changes in Disk I/O (CPU correlates) overall: -81% searchRelatedNotes: -87% keyword-based search: -96% Saves TBs of I/O
  • 23.
    Phase 2: AddCompression • User Indexes sizes and access patterns are skewed • Optimize large accounts • Directory-level compression • Compress segment files, invisible to the IndexReader • Only when re-indexing / every 3 months • In-memory Caching
  • 24.
    LuceneTransform • https://code.google.com/p/lucenetransformby Mitja Lenič • We ported it to Lucene 4.5 (now available upstream for 4.9) • Improved LRU caching, added LZ4/Snappy compression • We will contribute our changes soon
  • 25.
    OverlayDirectory on disk: _23.cfe _23.si c$_23.cfs segments.gen segments_2 visible to IndexReader: _23.cfe _23.si _23.cfs segments.gen segments_2
  • 26.
    Results • Compressedthe largest 5% of all indexes using LZ4 • 1.9 TB index space saved • 100 MB LRU Cache hit rate: 79% on avg (67% — 93%) • Saved 0.5 PB disk reads/week • Cache is so good, may use better/slower compression algorithm, may apply to more users Saves PBs of I/O
  • 27.
    Bugs, Bugs, Bugs:-) • We’ve been warned “VInt bug” “background merge hit exception” JVM segfaults ! • and then this happened, too SPI / ContextClassLoaders … LUCENE-4713 Deadlocks / over-optimistic locking Unclosed resources / Too many open file handles => HousekeepingDirectory Issues with FieldCache singleton => LUCENE-831, LUCENE-2133, … … • UserIndex tracks “broken” state; allows self-healing (rebuild)
  • 28.
    Conclusion • Design-By-Contract,Separation of Concerns • Per-user Search Implementation / Multiple Lucene versions • Migrated 60M users, without noticeable downtime • Migration allowed index changes, saves TBs of disk I/O • Block-level Index Compression, saves PBs of disk I/O • This is just the beginning.
  • 29.
  • 30.