SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

2,649 views

Published on

"Box + Solr = Content Search for Business" - Wei Zhao, Box

Published in: Technology
  • Be the first to comment

  • Be the first to like this

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

  1. 1. 1 June 2014 Box + Solr = Content Search for Business
  2. 2. 2 Wei Zhao Box backend engineer wzhao@box.com
  3. 3. 3 to make organizations more productive, competitive and collaborative by connecting people and their most important information Box mission
  4. 4. 4 25MM+ Users 225K+ Businesses 99% Fortune 500
  5. 5. 5 Box search mission is to make user content easy to discover.
  6. 6. 6 10Billion+ Documents 10TB+ Index size 100M+ Daily requests Box uses Solr for search
  7. 7. 7 Quick Search
  8. 8. 8 Quick Search
  9. 9. 9 Full Search
  10. 10. 10 Sharding – splitting the index Agenda Highly available search A few more things 1 2 3 4 5 Q&A Currently working on
  11. 11. 11 We shard things
  12. 12. 12 Shard ID = File ID % Total Shards
  13. 13. 13 Multi-tenant – One big logical index for all users Solr index Shard1 Shard2 Shard3 ShardN
  14. 14. 14 Search scope
  15. 15. 15 File ID: 12345 OwnerID: user1 Parent Folders IDs: folder1, folder2 File Name: Solr.ppt File Content: blah ...... A typical Solr Document
  16. 16. 16 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
  17. 17. 17 User1 with no share folder Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
  18. 18. 18 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
  19. 19. 19 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder2 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4
  20. 20. 20 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder5 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4 Removed out of Folder2
  21. 21. 21 User2 shares Folder2 with User1 Owner: User1 Parent: Folder1 Owner: User2 Parent: Folder3 Owner: User2 Parent: Folder5 Owner: User1 Parent: Folder1 Folder4 File 1 File 2 File 3 File 4 Removed out of Folder2
  22. 22. 22 Highly Available Search
  23. 23. 23 • Index is highly available • Search functionality is highly available
  24. 24. 24 Index workflow
  25. 25. 25 Box Front End Upload Index Queue Queue 1 Queue 2 Queue 3 Indexer 1 Indexer 3 Indexer 2 MySQL Index1 Index2 Index2
  26. 26. 26 Search workflow
  27. 27. 27 Box Front End query HA Proxy Head node HA Proxy 1 2 3 N Box Front End query HA Proxy Head node HA Proxy 1 2 3 N Data center boundary
  28. 28. 28 A few more things
  29. 29. 29 File Content Search
  30. 30. 30 Box Front End Upload MySQL Box File Storage Indexer Solr Index Text Extraction Extracted Text
  31. 31. 31 Multi-language support
  32. 32. 32 Raw file content Language detector English tokenizer Spanish tokenizer Japanese tokenizer German tokenizer file_content_en File_content_es {hola} file_content_ja . . . . File_content_de
  33. 33. 33 To Dos • Scale language support • Support document with mixed languages
  34. 34. 34 Search Warm-up
  35. 35. 35 • Front end informs backend to warm up on keyboard focus • Backend prepares the search filter and caches it in a search session • Backend sends a warm-up query to Solr
  36. 36. 36 What we are working on
  37. 37. 37 • Search suggestions • Search operators • Use machine learning to influence ranking • Logical sharding Things we are working on
  38. 38. 38 Question?
  39. 39. 39 Contact: wzhao@box.com We are hiring!

×