1
June 2014
Box + Solr = Content Search for
Business
2
Wei Zhao
Box backend engineer
wzhao@box.com
3
to make organizations more productive,
competitive and collaborative by connecting
people and their most important infor...
4
25MM+
Users
225K+
Businesses
99%
Fortune 500
5
Box search mission is to make user content
easy to discover.
6
10Billion+
Documents
10TB+
Index size
100M+
Daily requests
Box uses Solr for search
7
Quick Search
8
Quick Search
9
Full Search
10
Sharding – splitting the index
Agenda
Highly available search
A few more things
1
2
3
4
5 Q&A
Currently working on
11
We shard things
12
Shard ID = File ID % Total Shards
13
Multi-tenant – One big logical index for all users
Solr index
Shard1 Shard2 Shard3 ShardN
14
Search scope
15
File ID: 12345
OwnerID: user1
Parent Folders IDs: folder1, folder2
File Name: Solr.ppt
File Content: blah
......
A typi...
16
Owner: User1
Parent: Folder1
Owner: User2
Parent: Folder3
Owner: User2
Parent: Folder2
Owner: User1
Parent:
Folder1
Fol...
17
User1 with no share folder
Owner: User1
Parent: Folder1
Owner: User2
Parent: Folder3
Owner: User2
Parent: Folder2
Owner...
18
User2 shares Folder2 with User1
Owner: User1
Parent: Folder1
Owner: User2
Parent: Folder3
Owner: User2
Parent: Folder2
...
19
User2 shares Folder2 with User1
Owner: User1
Parent: Folder1
Owner: User2
Parent: Folder3
Owner: User2
Parent: Folder2
...
20
User2 shares Folder2 with User1
Owner: User1
Parent: Folder1
Owner: User2
Parent: Folder3
Owner: User2
Parent: Folder5
...
21
User2 shares Folder2 with User1
Owner: User1
Parent: Folder1
Owner: User2
Parent: Folder3
Owner: User2
Parent: Folder5
...
22
Highly Available Search
23
• Index is highly available
• Search functionality is highly available
24
Index workflow
25
Box
Front
End
Upload
Index Queue
Queue 1
Queue 2
Queue 3
Indexer 1
Indexer 3
Indexer 2
MySQL
Index1
Index2
Index2
26
Search workflow
27
Box
Front
End
query HA
Proxy
Head
node
HA Proxy
1 2 3 N
Box
Front
End
query
HA
Proxy
Head
node
HA Proxy
1 2 3 N
Data ce...
28
A few more things
29
File Content Search
30
Box
Front
End
Upload
MySQL Box File
Storage
Indexer
Solr
Index
Text Extraction
Extracted
Text
31
Multi-language support
32
Raw file
content
Language
detector
English tokenizer
Spanish tokenizer
Japanese tokenizer
German tokenizer
file_content...
33
To Dos
• Scale language support
• Support document with mixed languages
34
Search Warm-up
35
• Front end informs backend to warm up on keyboard focus
• Backend prepares the search filter and caches it in a search...
36
What we are working on
37
• Search suggestions
• Search operators
• Use machine learning to influence ranking
• Logical sharding
Things we are wo...
38
Question?
39
Contact: wzhao@box.com
We are hiring!
Upcoming SlideShare
Loading in...5
×

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

915

Published on

"Box + Solr = Content Search for Business" - Wei Zhao, Box

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
915
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • This has changed many of the access patterns to data in the enterprise.
  • This has changed many of the access patterns to data in the enterprise.
  • This has changed many of the access patterns to data in the enterprise.
  • SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

    1. 1. 1 June 2014 Box + Solr = Content Search for Business
    2. 2. 2 Wei Zhao Box backend engineer wzhao@box.com
    3. 3. 3 to make organizations more productive, competitive and collaborative by connecting people and their most important information Box mission
    4. 4. 4 25MM+ Users 225K+ Businesses 99% Fortune 500
    5. 5. 5 Box search mission is to make user content easy to discover.
    6. 6. 6 10Billion+ Documents 10TB+ Index size 100M+ Daily requests Box uses Solr for search
    7. 7. 7 Quick Search
    8. 8. 8 Quick Search
    9. 9. 9 Full Search
    10. 10. 10 Sharding – splitting the index Agenda Highly available search A few more things 1 2 3 4 5 Q&A Currently working on
    11. 11. 11 We shard things
    12. 12. 12 Shard ID = File ID % Total Shards
    13. 13. 13 Multi-tenant – One big logical index for all users Solr index Shard1 Shard2 Shard3 ShardN
    14. 14. 14 Search scope
    15. 15. 15 File ID: 12345 OwnerID: user1 Parent Folders IDs: folder1, folder2 File Name: Solr.ppt File Content: blah ...... A typical Solr Document
    16. 16. 16 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
    17. 17. 17 User1 with no share folder Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
    18. 18. 18 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
    19. 19. 19 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
    20. 20. 20 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder5 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4 Removed out of Folder2
    21. 21. 21 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder5 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4 Removed out of Folder2
    22. 22. 22 Highly Available Search
    23. 23. 23 • Index is highly available • Search functionality is highly available
    24. 24. 24 Index workflow
    25. 25. 25 Box Front End Upload Index Queue Queue 1 Queue 2 Queue 3 Indexer 1 Indexer 3 Indexer 2 MySQL Index1 Index2 Index2
    26. 26. 26 Search workflow
    27. 27. 27 Box Front End query HA Proxy Head node HA Proxy 1 2 3 N Box Front End query HA Proxy Head node HA Proxy 1 2 3 N Data center boundary
    28. 28. 28 A few more things
    29. 29. 29 File Content Search
    30. 30. 30 Box Front End Upload MySQL Box File Storage Indexer Solr Index Text Extraction Extracted Text
    31. 31. 31 Multi-language support
    32. 32. 32 Raw file content Language detector English tokenizer Spanish tokenizer Japanese tokenizer German tokenizer file_content_en File_content_es {hola} file_content_ja . . . . File_content_de
    33. 33. 33 To Dos • Scale language support • Support document with mixed languages
    34. 34. 34 Search Warm-up
    35. 35. 35 • Front end informs backend to warm up on keyboard focus • Backend prepares the search filter and caches it in a search session • Backend sends a warm-up query to Solr
    36. 36. 36 What we are working on
    37. 37. 37 • Search suggestions • Search operators • Use machine learning to influence ranking • Logical sharding Things we are working on
    38. 38. 38 Question?
    39. 39. 39 Contact: wzhao@box.com We are hiring!
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×