Your SlideShare is downloading. ×

Solr for Analytics

4,207
views

Published on

Aggregating Metrics with Solr at Sematext

Aggregating Metrics with Solr at Sematext

Published in: Technology, Education

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,207
On Slideshare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
49
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hi :)
  • 2. "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it..." - Dan Ariely, Duke University
  • 3. Solr for Analytics Metrics Aggregations at Sematext Otis Gospodnetić @otisg
  • 4. OG - Pro … since circa 2000 …. since 2007
  • 5. Needle Addiction
  • 6. Watermelon Addiction
  • 7. Bread Love
  • 8. Nuts!
  • 9. Addiction Manifestation #1 Gathering Search Analytics ✓ ✓ ✓ Search Analytics
  • 10. Addiction Manifestation #2 Gathering Search Analytics ✓ ✓ ✓ SPM
  • 11. Addiction Manifestation #3 Gathering Search Analytics ✓ ✓ ✓ Logsene
  • 12. The Surface
  • 13. Under the Hood
  • 14. ?
  • 15. Under the Hood
  • 16. Stripped <dimensionList name="solr_caches_metrics_cube" formula=" system*(month+week+day+hour+five_minutes+minute)*solr_cache_name*host*solr_core_name"/>
  • 17. Schema <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/> <field name="baseRowKey" type="binary" indexed="false" stored="true" required="true" multiValued="false"/> <field name="dimListId" type="int" <field name="metricsData" type="binary" indexed="false" stored="true" required="true" multiValued="false"/> <dynamicField name="dim_str_*" /> indexed="true" stored="true" required="true" multiValued="false"/> type="string" indexed="true" stored="false" multiValued="true" docValues="true" <dynamicField name="dim_num_*" type="tlong" indexed="true" stored="false" multiValued="true" docValues="true" /> <dimensionList name="solr_caches_metrics_cube" formula="system*(month+week+day+hour+five_minutes+minute)*solr_cache_name*host*solr_core_name"/>
  • 18. Append on Write; Compact Later?
  • 19. Compaction… maybe?
  • 20. github.com / sematext / HBaseHUT Ideally
  • 21. Merge on Read
  • 22. Merge on Read
  • 23. Metric Retrieval params={ shard.keys=2349_201!& collection=partialPreaggCollection&wt=javabin&qt=select& &start=0& q=dimListId:201& fq=dim_num_6:[1383570000000+TO+1383591600000]& fq=dim_str_0:MjM0OQ%3D%3D& fq=dim_str_18: (MDgwYTllNzgtNTZkMy00ODhlLWE2NGUtZmM2ODE5NGNkNGNk)& version=2& rows=99999} hits=74 status=0 QTime=41
  • 24. Write Friendly Present SOLR-4816 - CloudSolrServer SOLR-5232 - Streaming updates between nodes SOLR-3249 - Javabin between nodes SOLR-???? - Aggregating URP
  • 25. 4.4 vs. 4.5 4.4 - no routing vs. 4.5 routing via CloudSolrServer Faster AFS flushes Faster indexing: 3.5 M vs. 4 M docs in 55 min Lower heap: circa 10%
  • 26. DocValues <field name="id" /> type="string" indexed="true" stored="true" required="true" docValues="true" <dynamicField name="dim_str_*" /> type="string" indexed="true" stored="false" multiValued="true" docValues="true" <dynamicField name="dim_num_*" type="tlong" indexed="true" stored="false" multiValued="true" docValues="true" /> Forward index data structure … in memory and/or on disk … compressible … off-heap See https://cwiki.apache.org/confluence/display/solr/DocValues http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
  • 27. !DocValues vs. DocValues With DocValues: … smaller heap … more numerous, but smaller GCs … roughly equal query latency … faster faceting, sorting, and grouping (but we are not using that in SPM with Solr) … didn’t see slow down in indexing (hmmm)
  • 28. !DocValues vs. DocValues
  • 29. Analytics Friendly Future LUCENE-5189 - Numeric DocValues Updates SOLR-2894 - Distributed Pivot Facets SOLR-5302 - Analytics Component SOLR-4465 - Configurable Collectors SOLR-5045 - Pluggable Analytics SOLR-5027 - Field Collapsing PostFilter
  • 30. FIN. Questions? sematext.com/about/jobs 20% off: LR2013SPM20 /spm /logsene /search-analytics otis@sematext.com @otisg @sematext
  • 31. Filter Collection