Solr for Analytics

5,782
-1

Published on

Aggregating Metrics with Solr at Sematext

Published in: Technology, Education
1 Comment
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,782
On Slideshare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
70
Comments
1
Likes
8
Embeds 0
No embeds

No notes for slide

Solr for Analytics

  1. 1. Hi :)
  2. 2. "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it..." - Dan Ariely, Duke University
  3. 3. Solr for Analytics Metrics Aggregations at Sematext Otis Gospodnetić @otisg
  4. 4. OG - Pro … since circa 2000 …. since 2007
  5. 5. Needle Addiction
  6. 6. Watermelon Addiction
  7. 7. Bread Love
  8. 8. Nuts!
  9. 9. Addiction Manifestation #1 Gathering Search Analytics ✓ ✓ ✓ Search Analytics
  10. 10. Addiction Manifestation #2 Gathering Search Analytics ✓ ✓ ✓ SPM
  11. 11. Addiction Manifestation #3 Gathering Search Analytics ✓ ✓ ✓ Logsene
  12. 12. The Surface
  13. 13. Under the Hood
  14. 14. ?
  15. 15. Under the Hood
  16. 16. Stripped <dimensionList name="solr_caches_metrics_cube" formula=" system*(month+week+day+hour+five_minutes+minute)*solr_cache_name*host*solr_core_name"/>
  17. 17. Schema <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/> <field name="baseRowKey" type="binary" indexed="false" stored="true" required="true" multiValued="false"/> <field name="dimListId" type="int" <field name="metricsData" type="binary" indexed="false" stored="true" required="true" multiValued="false"/> <dynamicField name="dim_str_*" /> indexed="true" stored="true" required="true" multiValued="false"/> type="string" indexed="true" stored="false" multiValued="true" docValues="true" <dynamicField name="dim_num_*" type="tlong" indexed="true" stored="false" multiValued="true" docValues="true" /> <dimensionList name="solr_caches_metrics_cube" formula="system*(month+week+day+hour+five_minutes+minute)*solr_cache_name*host*solr_core_name"/>
  18. 18. Append on Write; Compact Later?
  19. 19. Compaction… maybe?
  20. 20. github.com / sematext / HBaseHUT Ideally
  21. 21. Merge on Read
  22. 22. Merge on Read
  23. 23. Metric Retrieval params={ shard.keys=2349_201!& collection=partialPreaggCollection&wt=javabin&qt=select& &start=0& q=dimListId:201& fq=dim_num_6:[1383570000000+TO+1383591600000]& fq=dim_str_0:MjM0OQ%3D%3D& fq=dim_str_18: (MDgwYTllNzgtNTZkMy00ODhlLWE2NGUtZmM2ODE5NGNkNGNk)& version=2& rows=99999} hits=74 status=0 QTime=41
  24. 24. Write Friendly Present SOLR-4816 - CloudSolrServer SOLR-5232 - Streaming updates between nodes SOLR-3249 - Javabin between nodes SOLR-???? - Aggregating URP
  25. 25. 4.4 vs. 4.5 4.4 - no routing vs. 4.5 routing via CloudSolrServer Faster AFS flushes Faster indexing: 3.5 M vs. 4 M docs in 55 min Lower heap: circa 10%
  26. 26. DocValues <field name="id" /> type="string" indexed="true" stored="true" required="true" docValues="true" <dynamicField name="dim_str_*" /> type="string" indexed="true" stored="false" multiValued="true" docValues="true" <dynamicField name="dim_num_*" type="tlong" indexed="true" stored="false" multiValued="true" docValues="true" /> Forward index data structure … in memory and/or on disk … compressible … off-heap See https://cwiki.apache.org/confluence/display/solr/DocValues http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
  27. 27. !DocValues vs. DocValues With DocValues: … smaller heap … more numerous, but smaller GCs … roughly equal query latency … faster faceting, sorting, and grouping (but we are not using that in SPM with Solr) … didn’t see slow down in indexing (hmmm)
  28. 28. !DocValues vs. DocValues
  29. 29. Analytics Friendly Future LUCENE-5189 - Numeric DocValues Updates SOLR-2894 - Distributed Pivot Facets SOLR-5302 - Analytics Component SOLR-4465 - Configurable Collectors SOLR-5045 - Pluggable Analytics SOLR-5027 - Field Collapsing PostFilter
  30. 30. FIN. Questions? sematext.com/about/jobs 20% off: LR2013SPM20 /spm /logsene /search-analytics otis@sematext.com @otisg @sematext
  31. 31. Filter Collection

×