Solr Power FTW    Alex Pinkin     @apinkin    #solrnosql
What Will I Cover?● Who I am● What Bazaarvoice does● SOLR and NoSQL● Can SOLR handle 20K queries per second?● Lessons lear...
Alex Pinkin● Software Engineering Lead,  Data Infrastructure team,  Bazaarvoice● Loves to play with SQL and NoSQL         ...
Bazaarvoice● Bazaarvoice is a software as a service  company powering user generated content  such as ratings and reviews ...
NoSQL ?
SQL vs NoSQL
NoSQL is Not Only SQL● Departs from relational model● No fixed schema● No joins● Eventual consistency is OK● Scale horizon...
Types of NoSQL● Key-value (Redis, Riak, Voldemort)● Document (MongoDB, CouchDB)● Graph (Neo4J, FlockDB)● Column family (Ca...
SOLR as NoSQL● Non-relational model - Check● No fixed schema - Check (dynamic fields)● No joins - Check (denormalization)●...
SOLR stats - Bazaarvoice     Documents         250 MM     Index size        200 GB     QPS, avg          2,350     QPS, ma...
SOLR Case Study
SOLR Case Study
Life Before SOLR● Indexes for sorting and filtering● Aggregate tables for stats● Nightly jobs● Bugs...
Enter SOLR● Index content and product catalog● De-normalization● Filtering and sorting● Index every 15 minutes (20 seconds...
SOLR - Statistics● COUNT, SUM, AVG, MIN, MAX  (StatsComponent)● Stored fields● Whenever content changes,  re-calc stats fo...
Scaling reads - Replication
Replication - Multiple Data Centers
Replication - Multiple Data CentersChatty if using multiple coresRelay ● Core auto-warming disabled ● Connection wait and ...
SOLR Cloud - Bazaarvoice version● Multiple cores (100+ per server)● Re-balance indexes across cores and servers   ○ Automa...
Schema ChangesRe-indexing is time consuming for large indexesProcess    1. Full re-index off-line    2. Incremental indexi...
Performance Tuning ● Heap size ● Cache sizing ● Auto-warming ● Stored fields ● Merge factor ● Commit frequency ● Optimize ...
Performance Tuning - GC# Java memory usage settings# Force the NewSize to be larger than the JVM typically allocates.# In ...
SOLR Performance - Summary● SOLR loves RAM!● Log replay                         SOLR● Same config, same hardware● Get the ...
Conclusion - SOLR Strengths● Lightning fast given enough RAM● Good scale out support  including multi-data center● Great c...
Conclusion - SOLRs Gaps● Not fully elastic● Real time takes work● Secondary data store = sync overhead● Schema changes
Questions            @apinkin
Upcoming SlideShare
Loading in …5
×

Solr Power FTW: Powering NoSQL the World Over

992 views

Published on

Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
992
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Solr Power FTW: Powering NoSQL the World Over

  1. 1. Solr Power FTW Alex Pinkin @apinkin #solrnosql
  2. 2. What Will I Cover?● Who I am● What Bazaarvoice does● SOLR and NoSQL● Can SOLR handle 20K queries per second?● Lessons learned: large scale multi data center deployment● Conclusion
  3. 3. Alex Pinkin● Software Engineering Lead, Data Infrastructure team, Bazaarvoice● Loves to play with SQL and NoSQL @apinkin
  4. 4. Bazaarvoice● Bazaarvoice is a software as a service company powering user generated content such as ratings and reviews on thousands of web sites● 5 billion page views per month● 230 billion impressions● 75 million UGC
  5. 5. NoSQL ?
  6. 6. SQL vs NoSQL
  7. 7. NoSQL is Not Only SQL● Departs from relational model● No fixed schema● No joins● Eventual consistency is OK● Scale horizontally
  8. 8. Types of NoSQL● Key-value (Redis, Riak, Voldemort)● Document (MongoDB, CouchDB)● Graph (Neo4J, FlockDB)● Column family (Cassandra, HBase)
  9. 9. SOLR as NoSQL● Non-relational model - Check● No fixed schema - Check (dynamic fields)● No joins - Check (denormalization)● Horizontal scaling - Check (with work)
  10. 10. SOLR stats - Bazaarvoice Documents 250 MM Index size 200 GB QPS, avg 2,350 QPS, max 10,200 Response time, avg 12 ms Servers 6+20
  11. 11. SOLR Case Study
  12. 12. SOLR Case Study
  13. 13. Life Before SOLR● Indexes for sorting and filtering● Aggregate tables for stats● Nightly jobs● Bugs...
  14. 14. Enter SOLR● Index content and product catalog● De-normalization● Filtering and sorting● Index every 15 minutes (20 seconds NRT)
  15. 15. SOLR - Statistics● COUNT, SUM, AVG, MIN, MAX (StatsComponent)● Stored fields● Whenever content changes, re-calc stats for all affected subjects
  16. 16. Scaling reads - Replication
  17. 17. Replication - Multiple Data Centers
  18. 18. Replication - Multiple Data CentersChatty if using multiple coresRelay ● Core auto-warming disabled ● Connection wait and read timeouts increased ● Replication poll interval increased (15 min) ● Compression enabled...<str name="httpConnTimeout">20000</str><str name="httpReadTimeout">65000</str><str name="pollInterval">00:15:00</str><str name="compression">internal</str>...
  19. 19. SOLR Cloud - Bazaarvoice version● Multiple cores (100+ per server)● Re-balance indexes across cores and servers ○ Automatic ○ Manual● Deployment map stored in MySQL ○ Host - Core - Partition ○ Statistics● Partition lifecycle
  20. 20. Schema ChangesRe-indexing is time consuming for large indexesProcess 1. Full re-index off-line 2. Incremental indexing prior to release 3. Incremental indexing after the releaseBottleneck: reading from MySQLGoal: Transparent re-indexing
  21. 21. Performance Tuning ● Heap size ● Cache sizing ● Auto-warming ● Stored fields ● Merge factor ● Commit frequency ● Optimize frequencyProcess: Simulate and measure ● Replay logs ● Analyze metrics ● Monitor GC
  22. 22. Performance Tuning - GC# Java memory usage settings# Force the NewSize to be larger than the JVM typically allocates.# In practice, the JVM has been allocating an extremely small Young generationwhich objects to be prematurely promoted to the Tenured generationJAVA_MEM_OPTS="-Xms27g -Xmx27g -XX:NewRatio=8"# -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps --> Turn on GC Logging# -XX:+UseConcMarkSweepGC --> Use the concurrent collector# -XX:+CMSIncrementalMode --> Incremental mode for the concurrent collector# -XX:+CMSIncrementalPacing --> Let the JVM adjust the amount of incrementalcollectionJAVA_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=55 -XX:ParallelGCThreads=8 -XX:SurvivorRatio=4"
  23. 23. SOLR Performance - Summary● SOLR loves RAM!● Log replay SOLR● Same config, same hardware● Get the most out of one instance
  24. 24. Conclusion - SOLR Strengths● Lightning fast given enough RAM● Good scale out support including multi-data center● Great community
  25. 25. Conclusion - SOLRs Gaps● Not fully elastic● Real time takes work● Secondary data store = sync overhead● Schema changes
  26. 26. Questions @apinkin

×